GithubHelp home page GithubHelp logo

katago / katago-server Goto Github PK

View Code? Open in Web Editor NEW
57.0 5.0 19.0 2.85 MB

License: Other

Dockerfile 0.68% Shell 1.64% Python 57.74% JavaScript 0.01% CSS 31.31% HTML 8.37% SCSS 0.07% Sass 0.03% Smarty 0.15%

katago-server's Introduction

katago-server

Collaborative server for Katago

Built with Cookiecutter Django Black code style
License:MIT

Installation and Setup

After setting up KataGo you should do the following steps on Ubuntu:

sudo apt install docker docker-compose

sudo systemctl enable docker

sudo usermod -aG docker INSERT_YOUR_USERNAME

then reboot

Starting/Stopping Docker (for a Local Test Server)

docker-compose -f local.yml build

docker-compose -f local.yml up # start up and output logs to current shel

docker-compose -f local.yml up -d # daemonize rather than attaching to the current shell

To stop it, ctrl-c. Or if it was started daemonized or you want to stop it from another shell:

docker-compose -f local.yml down

You should be able to see your server in a browser from something like http://localhost:3000

Running the tests (for a Local Test Server)

Using the local server you set up, this command will run some tests:

docker-compose -f local.yml run --rm django pytest -vv

Or you can do this if you also want to print stdout as well in each test, which can be useful for debugging a test itself:

docker-compose -f local.yml run --rm django pytest -vvs

Create an initial admin user for the website

docker-compose -f local.yml run --rm django python manage.py createsuperuser

Once you have an initial admin user, you should be able to visit http://localhost:3000/admin and create a Run and create an initial random network for the run (or a non-random network, if you want to start with a network trained from elsewhere).

You will also want to immediately create periodic jobs for updating the bayesian Elo, for refreshing the materialized views that store stats about uploaded games and data, setting them to run every few minutes.

Connecting KataGo to the local server

Once you've put in the necessary configs into the Run and set up the first network through the admin panel, you should be able to connect the distributed KataGo client to it (katago contribute) specifying http://localhost:3000 as the url of the server and have it work.

The distributed KataGo client can be built from the "distributed" branch of https://github.com/lightvector/KataGo/ or once it is released you can also download a prebuilt binary.

Getting a shell inside a container

If you want to get "inside" a container and actually have a running interactive shell there to be able to inspect things, run commands, etc, try this, depending on which container (django, nginx, etc) you want to get a shell inside:

docker-compose -f local.yml run --rm django bash docker-compose -f local.yml run --rm postgres bash docker-compose -f local.yml run --rm nginx bash

Accessing the raw database

If you want to get direct raw access to the database to run raw postgres queries an inspect the tables that django has set up, then you can do something like this:

docker exec -it NAME_OF_POSTGRES_CONTAINER psql -U DATABASE_USER_NAME -d katago_server_db

You can run "docker container list" to see the containers that you have running, and fill in the appropriate name or id in this command line, and also unless you changed it, the user name for a local server database is hardcoded to "debug", so you might have something like:

docker exec -it server_postgres_1 psql -U debug -d katago_server_db

Migrations

If you change any of the model definitions (Run, TrainingGame, Network, StartPos, etc) or add a new one, you will want to run:

docker-compose -f local.yml run --rm django python manage.py makemigrations

This will tell django to make a file that will perform the necessary database alterations. Depending on how you set up docker, this migrations file might be owned by root or something like that because it was created from within docker. You may need to sudo chown the file to be owned by you. Then, add the file to be tracked under github and commit it along with your changes.

On a local server, migrations should be actually applied the next time you start up the server. But to explicitly and manually do this if you want:

docker-compose -f local.yml run --rm django python manage.py migrate

Removing Docker Images and Volumes ("I messed up and want to start over")

Check for images and volumes:

docker image list

docker volume list

Then you can prune both lists for unused stuff:

docker image prune

docker volume prune

You can directly remove images you don't want:

docker image rm INSERT_NAME_OF_IMAGE

If you stop all containers and remove everything then this should put you back in a clean state to before you built the server or did anything.

Setting up a production server

To set up a production server, you'll need to also:

  • Copy ./envs/production_example to envs/production and edit each of those files where it indicates you should fill in an domain name, email, api key, or other parameter for your actual production site. These are the various environment variables that docker compose will expose within all the individual containers for django, postgres, etc, which those containers' main process (django process, postgres database process, etc) will use to configure themselves.
  • Copy ./env_example to ./.env and edit the few environment variables within similarly. It must be named ".env" - this is the name of the file that docker-compose attempts to read upon startup to grab extra environment variables out of, which are used in the docker-compose file itself.
  • In the process, you'll need to own an actual domain name with nameservers pointed appropriately to the box that this server will run on, sign up for mailgun and sentry and a few other recommended monitoring services for the website, and such.

On a production server to pick up code changes, you will need to rerun:

docker-compose -f production.yml build

And also if there are migrations, you will need to run this to actually apply the migrations to the production database:

docker-compose -f production.yml run --rm django python manage.py migrate

Be a little careful about whether the site should stay up or be taken down while the database is migrated. For a local server, these steps happen automatically simply when you "up" the server, but for prod, they must be done explicitly.

Helpful links

Email Server

lightvector: This bit is leftover from the django cookiecutter readme, I'm not sure how relevant it is with all the docker containers in the way, but it sounds maybe like a useful thing to be able to do when testing, modulo the fact that there's potentially a docker container layer in between that you have to work through?.

In development, it is often nice to be able to see emails that are being sent from your application. For that reason local SMTP server MailHog with a web interface is available as docker container.

Container mailhog will start automatically when you will run all docker containers. Please check `cookiecutter-django Docker documentation`_ for more details how to start all containers.

With MailHog running, to view messages that are sent by your application, open your browser and go to http://127.0.0.1:8025

katago-server's People

Contributors

dependabot[bot] avatar iopq avatar kcwu avatar lightvector avatar nihisil avatar petgo3 avatar sanderland avatar tychota avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

katago-server's Issues

verification mail sender issue

When register, katago server will send an verification mail.

The problem is, the sender is filled as [email protected]. Note the domain mail.katagotraining.org doesn't exist. My mail server thought mails from non-existing domains are spam and rejected them.

Please either set up mail.katagotraining.org domain, or change the sender to something like [email protected].

Server connection being interrupted

2020-10-10 22:48:29+0800: Finished game 14782 (training), uploaded sgf katago_contribute/rect15/sgfs/rect15-b15c192-s169896448-d36146427/28BE3A23EC4F7FFD.sgf and training data katago_contribute/rect15/tdata/rect15-b15c192-s169896448-d36146427/A68FAAEB61F9C93F.npz (41 rows)
^C2020-10-10 23:09:10+0800: downloadModelIfNotPresent: Error connecting to server, possibly an internet blip, or possibly the server is down or temporarily misconfigured, waiting 5 seconds and trying again.
2020-10-10 23:09:10+0800: Error was:
No response from server
2020-10-10 23:09:10+0800: Beginning shutdown
2020-10-10 23:09:10+0800: Exited cleanly after signal
2020-10-10 23:09:10+0800: All cleaned up, quitting

note that I had to ^C it first, it was stuck after losing server connection for half an hour
when I sent a signal to it that's when it actually gave me the error - it didn't propagate the error right away, so maybe that's why it didn't retry

Inconsistent password acceptance

I tried to contribute with training for the first time, but the console directly said my username/password combination was wrong. It turned out I had a character in my password that caused the issue. The special characters I had were '@' and '#'. I suspect the config reader saw # as a comment and didn't take the characters after into account.

The problem is that the site let me make a password with the character. So I would recommend either disallowing the character when creating a password (though of course existing password will still have the problem) or fixing the bug that causes the error.

unexpected result komi 6.5 results in draw and possibly suspicious sgf submitted to server

this game (https://katagotraining.org/sgfplayer/rating-games/1279994/) has result "draw" which should be impossible with a komi of 6.5 unless i'm mistaken

it's a match game, parsed in https://katagotraining.org/networks/kata1/kata1-b18c384nbt-s9791399168-d4261348054/rating-games/ , also the moves are slightly supsicious (unusual at least i would say i mean) even though this is subjective

other games uploaded by the same user seem to show weird or at least unusual joseki and game even though this is subjective too and may be just my impression or not etc in : https://katagotraining.org/sgfplayer/rating-games/1279992/

Split "game" into "trainingGame" and "estimationGame"

Why

  • estimationGame are played without noise and with less selplay
  • they can introduce bias

@tychota
I was thinking also of using them for training.
But same, i can change it back.
Can you explain me what bias it introduce ?

@lightvector
Extreme case: Moves X and Y are about similarly good but are distinctively recognizable as different styles of move. Network A is much weaker and prefers move X. Network B is much stronger and prefers move Y. This is due to random difference in preferences, not because Y is actually much better than X, the difference in strength is due to other later moves. Nonetheless, the training may "learn" that if it sees move X played, the game will likely be a loss because it "must be network A that's playing this one", and if it sees move Y played, that the game will likely be a win because it "must be network B that's playing this one". So you will get false evaluations of move X and Y.
There are other (unavoidable) sources of error inherent in training, but this source of error is at least avoidable if a network only plays itself. Then, even if you "recognize" what network it is, you can't inherently say that it's more or less likely to win, because as a baseline, its opponent is equal-strength still.

  • the risk is not worth

@lightvector
But there's also the question of where we want to spend our "experimentation budget" - trying too many things at once is risky, would rather try this or experiment in some other way. Nobody else has used gating matches for selfplay training before, so it would be an experiment.

How

  1. Split django model into:
  • one abstract class AbstractGame
  • two concrete classes TrainingGame and EloEstimationGame
  1. write API for both

  2. Modify the distributed_efforts model / api to specify weither it is a EloEstimationGameTask, PredefiendTrainingGameTasks (eg game from human sgf) or a DynamicTask (not stored in db ??).

Implement Bayesian Elo to estimate newtork strength

Why

  • avoiding rock paper estimations that will falsly buff the elo

  • giving true estimation

Remi Coulomb (https://www.remi-coulom.fr/Bayesian-Elo/)
The main flaw of this approach is that the estimation of uncertainty does as if a player had played against one opponent, whose Elo is equal to the mean Elo of the opponents. This assumption has bad consequences for the estimation of ratings and uncertainties:

  • The expected result against two players is not equal to the expected result against one single player whose rating is the average of the two players.
  • Estimation of uncertainty is wrong, because 10 wins and 10 losses against a 1500-Elo opponent should result in less uncertainty than 10 wins against a 500-Elo opponent and 10 losses against a 2500-Elo opponent.

Also, another problem is that the estimation of uncertainty in Elostat does as if the rating of opponents are their true ratings. But those ratings also have some uncertainty that should be taken into consideration.

Constraint

  • fair game only

@tychota
Does it require fair match ?
Or it can also work with extra komi / handicap.
My understanding reading remi coulomb paper is that it can work.
As soon as we can calculate elo advantage probability but not sure.

@ lightvector
Fair match is ideal, because with komi and handicap you don't know how much those "affect" the model. Either you have to explicitly consider "bot that also gets 2 stones to start" as a separate bot from "bot that plays even" and track its rating separately, or you have to have a model (which may not be correct) about how much a handicap stone affects things.

You can make the amount that a handicap stone affects things part of the model to be tuned itself, but one issue there is that it's not linear. Remi Coulom's old papers do describe how to easily do a Bradley-Terry model where "teams" compete and the team is linear in the strength of its players in Elo space. So you could have a team of "bot" and "handicap stone", so that "handicap stone" gets its own rating delta. But the problem is that it's not linear - the stronger you get, the more a handicap stone is.
(Bradley Terry model is simply "one logarithm away" from Elo model, so it's basically Elo model too)
So I would say basically that fair matches are required.

How

  • in python (i would prefer sticking to python for the server), create a celery task that
    • compute EloEstimationGame results for each networks,

TBC

Links

Design and estimate the cost of cloud storage

Why

  • doing cloud storage is safer, as the provider replicate the file and ensure a level of quality of service
  • GCP or AS object store are quite Ok storing million of small file

Design

I expect to have two buckets on GCP.

  1. a public bucket:

    • sgf file are uploaded to it after mime file verification (and other verification ?)
    • npz training data are uplaoded to it
  2. a training bucket

  • every 400 TrainingGame a task download the 400 (??) npz, concatenate them together and upload them to the training bucket
  • the unpacked npz are removed from public bucket

The public bucket is readable publicly, while upload must go through the server (? investigate signed POST so the upload does not goes through the server).

The training bucket is not readable publicly.
Every training machine periodically sync file from training bucket in order to get concatenated npz, suffle them locally and train on shuffled data.

Then the network is uploaded to the public bucket.

Links

Let uncertainty factor be configurable instead of hardcoded 2

(issue after discord discussion)

[gjm] I wonder whether it would be an improvement to make use of the prior information we already have: most networks are similar in strength to previous ones, or some simple extrapolation from previous ones. So why not plot a graph that takes a prior based on other recent networks, updates on evidence from games played, and shows something like the 2.5, 50, 97.5 percentiles of the posterior?

[lightvector] Right now it does show the posterior already, it's just that the prior is very weak. The nice thing about having only a very weak prior that current network = previous network is that the entire thing is essentially unbiased. I know you tried to cover that by mentioning "simple extrapolation", but then you start having additional parameters relating to how to do that, and it gets messy, particularly since the optimization algorithm currently doesn't handle priors like "assume things continue in straight line log-scale trends" or whatever.
It seems a lot simpler to just move the "strongest confident" algorithm to be 3 sigma instead of 2 and let things continue to be nearly-unbiased.
Which is still a (simple) open task.

First, why we need to compute that value ?

We need the lower confidence ( log_gamma_lower_confidence) to select best network (so katrain can download it for example).

We need the upper confidence ( log_gamma_upper_confidence) to select a network worth spending rating game on it. That way, we eliminate false good network (high elo and high uncertainty).

We need to store in db log_gamma_lower_confidence = log_gamma - <some_factor> * log_gamma_uncertainty and log_gamma_upper_confidence = log_gamma + <some_factor> * log_gamma_uncertainty because we are going to sort by log_gamma_lower_confidence or log_gamma_upper_confidence and without database indexes the queries are going to be slow. But without relying on sql dark magic, we need the result to be precomputed to be indexed.

Then, when we need to compute that value ?

We set the field first on upload which is indeed

data = validated_data.copy()
if "parent_network" in data:
if data["parent_network"]:
data["log_gamma"] = data["parent_network"].log_gamma
data["log_gamma_uncertainty"] = 2.0
data["log_gamma_lower_confidence"] = data["log_gamma"] - 2 * data["log_gamma_uncertainty"]
data["log_gamma_upper_confidence"] = data["log_gamma"] + 2 * data["log_gamma_uncertainty"]
if "log_gamma" in data and "log_gamma_uncertainty" not in data:
data["log_gamma_uncertainty"] = 2.0
if "log_gamma" in data and "log_gamma_uncertainty" in data:
if "log_gamma_lower_confidence" not in data:
data["log_gamma_lower_confidence"] = data["log_gamma"] - 2 * data["log_gamma_uncertainty"]
if "log_gamma_upper_confidence" not in data:
data["log_gamma_upper_confidence"] = data["log_gamma"] + 2 * data["log_gamma_uncertainty"]
.

We also update network every 10 min here

current_run = Run.objects.select_current()
if current_run is None:
return
network_ratings = Network.pandas.get_ratings_dataframe(current_run)
anchor_network = Network.objects.filter(run=current_run).order_by("pk").first()
if anchor_network is None:
return
detailed_tournament_result = RatingGame.pandas.get_detailed_tournament_results_dataframe(
current_run, for_tests=for_tests
)
assert_no_match_with_same_network = (
detailed_tournament_result["reference_network"] != detailed_tournament_result["opponent_network"]
)
detailed_tournament_result = detailed_tournament_result[assert_no_match_with_same_network]
bayesian_rating_service = BayesianRatingService(
network_ratings, anchor_network.id, detailed_tournament_result, current_run.virtual_draw_strength
)
new_network_ratings = bayesian_rating_service.update_ratings_iteratively(current_run.elo_number_of_iterations)
Network.pandas.bulk_update_ratings_from_dataframe(new_network_ratings)
which uses
dataframe["log_gamma_upper_confidence"] = dataframe["log_gamma"] + 2 * dataframe["log_gamma_uncertainty"]
dataframe["log_gamma_lower_confidence"] = dataframe["log_gamma"] - 2 * dataframe["log_gamma_uncertainty"]

Which is good as we don't need to care for old network, they would be updated anyway :slight_smile:

How to change ?

As you can see there is some coupling (two identical magic constant, so maybe the 2 can be extracted in run model with an integer field (or maybe float field) such as

elo_number_of_iterations = IntegerField(
_("Elo computation number of iterations"),
help_text=_("How many iterations to use per celery task to compute log_gammas and Elos."),
default=10,
validators=[validate_positive],
and added to admin : this way it can be changed at runtime.

Will it break anything ?

  • people relying on log_gamma_upper_confidence from API will now have really different value
    => we can just notify potential users here for now,
    => ideally we shouldn't expose log_gamma_lower_confidence or log_gamma_upper_confidence but let people calculate it from exposed log_gamma and log_gamma_uncertainty

  • UI (network page, elo graph) needs update if we wantit to be coherant with the internal changes ( @lightvector#3657 do we ?) :
    => django generated view uses

    def rating(self):
    return f"{self.elo:.{self.elo_precision}f} ± {2 * self.elo_uncertainty:.{self.elo_precision}f}"

    => elo graph uses
    .attr("y1", network => yScale(network["elo"] - 2.0 * network["elostdev"]))
    and hard code X * network["elostdev"], eg for graph axis min elo and max elo
    var minElo = d3.min(filtered.map(network => network["elo"] - Math.min(300, 2.5 * network["elostdev"])));
    var maxElo = d3.max(filtered.map(network => network["elo"] + Math.min(300, 2.5 * network["elostdev"])));

  • unit tests will maybe break (if they are good they will)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.