allegroai / clearml-server Goto Github PK

ClearML - Auto-Magical CI/CD to streamline your AI workload. Experiment Management, Data Management, Pipeline, Orchestration, Scheduling & Serving in one MLOps/LLMOps solution

Home Page: https://clear.ml/docs

License: Other

Python 99.49% Jinja 0.06% Dockerfile 0.11% Shell 0.34%

version-control experiment-manager version control experiment deeplearning deep-learning machine-learning machinelearning ai

clearml-server's People

Contributors

Stargazers

Watchers

Forkers

raviv mcroesjon danaharralds ruthchil dian3f delhavez 0i0 evgenytc grubh lucyze culben wjayjay mb3rg doliveralg evg-allegro aaglitch sebecks pgrug heiphi nanyb anjitt itayod andr0ss s0bar junitaz vintevia orbanjerbi ricknogea jansglover edgorge jkolpaso vdurnerr aleksganz jpedrone hbower1 trainsai ainoam x010 fengzifrank yongjunjian fendaq ophir91 gaxler hyamsg lcasassa juanlp hiohio2 goku12321 shomratalon legigor yjshen1982 merry1314 chaoshengt bohblue2 aaad xuxinjs ttddtd chaen-empath lina-yousef micseb jkhenning pollfly igorkasianenko techainer andreyshmelz aljeshishe hugmatj marcojoao weixiao-huang nestorlong mokto beyondminds nielstenboom coqui-ai fsdexter mmiller-max jeffamaxey anvaru raohuaming spyxx ivkalgin west789 hongshibao vietnduc informaticacba john-zielke-snkeos chadhgy jktech-coi anhngml lexuanthinh tund kex5n sunwood-ai-labs hanship0530 shyallegro cyd3nt alndaly yosagi jinwoongyoo skogsbrus

clearml-server's Issues

Internal Server Error

Following the Docker installation instructions on Ubuntu 18.04 I end up with a page stating "Internal Server Error" when I try to access the webserver.

The last entry in the webserver logfile is:

[2019-06-25 09:51:57,329] [7] [INFO] [trains.webserver] ################ Web Server initializing #####################

Description flushes when cloning experiment

Bug: when I try to type a description in a cloned experiment, it constantly disappears after about 5 seconds of typing -- no matter the length of the string typed.

Here's a recording of this issue at 18 fps: video.zip

Server password doesn't work suddenly and needs a restart

I'm running trains-server locally (for testing purposes on OSX). After a bit of fiddeling, I think the docker containers work as expected, at least I don't see any errors (when running without -d). I'm following the trains-server suggestion of setting ~/trains.conf simply to

api {
    host: "http://localhost:8008"
}

I tried running trains-init, but when going to the localhost:8080\admin, generating the credentials does not work (\admin is redirected to \profile). Does not work here means, that a spinner appears, and after a few seconds it disappears but nothing else happens.

So when testing the matplotlib example, nothing seems to happen, instead I get many lines of

Retrying (Retry(total=239, connect=239, read=240, redirect=240, status=240)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x11c0030b8>: Failed to establish a new connection: [Errno 61] Connection refused',)': /auth.login

It seems that something is not working with respect to the api server? This is how I'm running it and its output:

sudo docker run --restart="always" --name="trains-apiserver" --network="host" -v /opt/trains/logs:/var/log/trains -v /opt/trains/config:/opt/trains/config allegroai/trains:latest apiserver
Password:
[2019-07-18 23:21:22,100] [9] [WARNING] [trains.schema] failed loading cache: [Errno 2] No such file or directory: '/opt/trains/server/schema/services/_cache.json'
[2019-07-18 23:21:22,101] [9] [INFO] [trains.schema] regenerating schema cache
[2019-07-18 23:21:23,528] [9] [INFO] [trains.server] ################ API Server initializing #####################
[2019-07-18 23:21:23,530] [9] [INFO] [trains.database] Initializing database connections
[2019-07-18 23:21:23,531] [9] [INFO] [trains.database] Registering connection to auth-db (mongodb://127.0.0.1:27017/auth)
[2019-07-18 23:21:23,532] [9] [INFO] [trains.database] Registering connection to backend-db (mongodb://127.0.0.1:27017/backend)
[2019-07-18 23:21:23,534] [9] [INFO] [trains.init_data] Applying mappings to host: http://127.0.0.1:9200
[2019-07-18 23:21:23,795] [9] [INFO] [trains.init_data] [{'mapping': 'events_log', 'result': '{"acknowledged":true}'}, {'mapping': 'events_training_debug_image', 'result': '{"acknowledged":true}'}, {'mapping': 'events_plot', 'result': '{"acknowledged":true}'}, {'mapping': 'events', 'result': '{"acknowledged":true}'}]
[2019-07-18 23:21:24,073] [9] [INFO] [trains.server] Exposed Services: auth.create_credentials auth.create_user auth.edit_user auth.fixed_users_mode auth.get_credentials auth.get_token_for_user auth.login auth.logout auth.revoke_credentials auth.validate_token events.add events.add_batch events.debug_images events.delete_for_task events.download_task_log events.get_multi_task_plots events.get_scalar_metric_data events.get_scalar_metrics_and_variants events.get_task_events events.get_task_latest_scalar_values events.get_task_log events.get_task_plots events.get_vector_metrics_and_variants events.multi_task_scalar_metrics_iter_histogram events.scalar_metrics_iter_histogram events.vector_metrics_iter_histogram models.create models.delete models.edit models.get_all models.get_all_ex models.get_by_id models.get_by_task_id models.set_ready models.update models.update_for_task projects.create projects.delete projects.get_all projects.get_all_ex projects.get_by_id projects.get_unique_metric_variants projects.update tasks.close tasks.completed tasks.create tasks.delete tasks.edit tasks.failed tasks.get_all tasks.get_all_ex tasks.get_by_id tasks.ping tasks.publish tasks.reset tasks.set_requirements tasks.started tasks.stop tasks.stopped tasks.update tasks.update_batch tasks.validate users.create users.delete users.get_all users.get_all_ex users.get_by_id users.get_current_user users.get_preferences users.set_preferences users.update
Loading config from /opt/trains/server/config/default
Loading config from file /opt/trains/server/config/default/hosts.conf
Loading config from file /opt/trains/server/config/default/logging.conf
Loading config from file /opt/trains/server/config/default/apiserver.conf
Loading config from file /opt/trains/server/config/default/secure.conf
Loading config from file /opt/trains/server/config/default/services/events.conf
Loading config from file /opt/trains/server/config/default/services/tasks.conf
Loading config from /opt/trains/config
 * Serving Flask app "server" (lazy loading)
 * Environment: production
   WARNING: This is a development server. Do not use it in a production deployment.
   Use a production WSGI server instead.
 * Debug mode: off

I guess I must be doing something wrong. Any help is greatly appreciated.

Login on fixed users mode doesn't work

After several attempts to define fixed users in apiserver.conf (and trying your example), signing in is unsuccessful.

After deleting and installing the new docker image and containers, I followed the instructions in your latest release to add a list of fixed users (and used no special characters).
I restarted the trains-apiserver container. And yet, when I got to the login screen (the new one, and not the "Enter Full Name" screen), signing in with the correct details gives an "Invalid User/Password combination" error.
The response from the auth.login service is the following:

{  
   "meta":{  
      "id":"802d6a6810d24c1a885292ef35c358dc",
      "trx":"802d6a6810d24c1a885292ef35c358dc",
      "endpoint":{  
         "name":"auth.login",
         "requested_version":"2.2",
         "actual_version":"1.0"
      },
      "result_code":401,
      "result_subcode":22,
      "result_msg":"Unauthorized (invalid credentials) (failed to locate provided credentials)",
      "error_stack":null
   },
   "data":{  

   }
}

I get the same error when using your example with username: jane, password: 12345678.

Accessing the data via API

Hi, thanks for the great opensource release!

I have one question regarding accessing the data from python.
Please excuse my utter ignorance in this matter.

I'd like to fetch data from trains-server for visualization / exploratory data analysis purposes.
Could you point me to some examples on how to do this in the most sane manner?

By digging through the code I found this:
https://github.com/allegroai/trains-server/tree/master/server/services
and looking at the @endpoint decorators I cooked up something like:

import requests
from pyhocon import ConfigFactory

cfg = ConfigFactory.parse_file("/home/elan/trains.conf")
base = cfg['api']['host']
user = cfg['api']['credentials']['access_key']
secret = cfg['api']['credentials']['secret_key']


def _get(endpoint, **kwargs):
    return requests.get(
        f'{base}/{endpoint}', 
        auth=(user, secret),
        params=dict(kwargs),
    ).json()['data']
    

def get_projects():
    return _get(
        'projects.get_all'
    )['projects']
    

def get_tasks():
    return _get(
        'tasks.get_all'
    )['tasks']


def get_scalar_metrics_and_variants(task_id):
    return _get(
        'events.get_scalar_metrics_and_variants', 
        task=task_id
    )


def get_scalar_metric_data(task_id, metric):
    return _get(
        'events.get_scalar_metric_data',
        task=task_id,
        metric=metric
    )

which I can then use as:

P = get_projects()
T = get_tasks()

first_task = T[0]['id']

avail_metrics = get_scalar_metrics_and_variants(first_task)

mse_loss = get_scalar_metric_data(first_task, 'MSE loss')

Is this the right approach? Or does trains provide some kind of abstractions to make this easier?

As mentioned, please excuse me my utter ignorance in this subject.

best,
Marcin

Help - Is there a possibility to configure LDAP based Authentication?

Add option to smooth scalar graphs (similar to TensorBoard, but better)

In TensorBoard one can control the amount of smoothing that is applied to all scalar graphs.
This is a desirable feature, since graphs can be very jittery.
Request: in the Metrics view (Experiments->Results->Metrics) add a settings option for "Smoothing" with a slider. Higher values will induce smoother graphs, while smaller values will result in minor smoothing, or not at all (for the value of zero).

Manage disk space usage

Hi,

It has been 3 weeks since I deploy trains server for my research team and I am monitoring the disk space used by the server data.

Today it takes more than 2GB (around 30 tasks) which seems to be huge for just graphs' coordinates and hyperparameters.
We use the default Pytorch-Tensorboard logger with resources monitoring disabled and all models are empty, they just display the path of the stored checkpoint.

90% of the space is used by Elastic Search indices and since there is no way to delete a training task, do you know how to "clean" the indices or how to reduce the amount of space used for each tasks ?

Thank you

Logging experiments to S3 bucket

I would like to have all the experiments stored to a S3 bucket, instead keeping them locally on the machine. If I'm not wrong, by default trains-server logs everything to /opt/trains/data/fileserver. Is it possible to connect trains-server directly to an S3 bucket without using an intermediate tool like s3fs or similar?

I'm able to upload the models and artifacts to an S3 bucket by using output_uri from trains, but I can't figure out how to log rest of the stuff (graphs, metrics, etc)

How to update trains-server when using AWS AMI ?

other than monitoring documentation / readme in the project page - any suggestions how to easily update the server (/services) ?

fileserver can't handle long file names

Hi,

Got 500 from the server while training, after looking in the logs I found out it was because of the file name.

Logs from /opt/trains/logs/fileserver.log

[2020-07-12 14:22:04,977] [7] [ERROR] [fileserver] Exception on / [POST]
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/flask/app.py", line 2447, in wsgi_app
    response = self.full_dispatch_request()
  File "/usr/local/lib/python3.6/site-packages/flask/app.py", line 1952, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/usr/local/lib/python3.6/site-packages/flask_cors/extension.py", line 161, in wrapped_function
    return cors_after_request(app.make_response(f(*args, **kwargs)))
  File "/usr/local/lib/python3.6/site-packages/flask/app.py", line 1821, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/usr/local/lib/python3.6/site-packages/flask/_compat.py", line 39, in reraise
    raise value
  File "/usr/local/lib/python3.6/site-packages/flask/app.py", line 1950, in full_dispatch_request
    rv = self.dispatch_request()
  File "/usr/local/lib/python3.6/site-packages/flask/app.py", line 1936, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "fileserver.py", line 32, in upload
    file.save(str(target))
  File "/usr/local/lib/python3.6/site-packages/werkzeug/datastructures.py", line 3066, in save
    dst = open(dst, "wb")
OSError: [Errno 36] File name too long: '/mnt/fileserver/trains/14-trains.bd72a5a2afdhsy2aa0acc3dca21b9b5f/metrics/Evaluator CV_no_my_real_name_no_my_real_name_no_my_real_name_len_512_Jul12_14-20-42_merge__no_my_real_nameh_sz_8__no_my_real_name_sz_8_lr_1e-06_w_decay_0.0_warm_up_50_in_sz_768_hid_sz_256_word_aug_p_0.0_no_my_real_name_1__no_my_real_name/_no_my_real_name _no_my_real_name_no_my_real_namele layer__no_my_real_namelen_512_Jul12_14-20-42__no_my_real_name_batch_sz_8_test_batch_sz_8_lr_1e-06_w_d_no_my_real_name0_in_sz_768_hid__no_my_real_name_0.0_word_aug_min_1_imba_no_my_real_name__no_my_real_name_00000000.jpeg'

Deploying trains on Kubernetes

I was wondering if you have a helm chart or pre-built docker images that could be used to deploy trains-server to a Kubernetes cluster hosted on Azure?

How to config SSL?

I use Nginx as the web server. It can let me to access trains(port 8080) by subdomain(e.g., trains.xxx.com). I also use a let's encrypt to set ssl for my web server.

When I open https://trains.xxx.com, I got some errors from my chrome console:

There is no password box in the webpage. I cannot login to my trains web page.

I can use ip:port to access my trains web page with some errors, but the password box is existed:

when I log in:

Is there any doc about how to config ssl?

Thanks for your help.

CORS Error when setting up trains server using docker

Azure VM: Ubuntu 18.04
Docker Version: latest

During initial login I am getting the error. I have followed the docker Ubuntu installation steps a in the guide.
Get error in web (developer tools- console):
Access to XMLHttpRequest at 'http://xxx.xxx.xx.xxx:8008/v2.5/users.get_preferences' from origin 'http://xxx.xxx.xx.xxx:8080' has been blocked by CORS policy: Response to preflight request doesn't pass access control check: No 'Access-Control-Allow-Origin' header is present on the requested resource.

unable to generate new projects and admin credentials when running a local server

Hello,

I've installed the server on my local machine. I'm able to access the trains page using the following address:
http://localhost:8080/dashboard
when the page opens I get the following messages on the right hand side of the screen:
X Fetching projects failed X
X Fetching recent experiments failed X
When trying to creating a new project I get the following message:
X Project Created Failed X
When trying to authenticate the server on the following address:
http://localhost:8080/profile
by pressing

Create New Credentials
the system does not respond. After pressing the + button beneath the bucket -
I get the following message:
~/trains.conf

api {

host: http://localhost:8008

credentials {

   "access_key" = "undefined"

   "secret_key" = "undefined"

}

Adding rest api docs

How to remove experiments or projects

I'm taking a look of trains-server and saw lots of example experiments are already recorded as default. How can I remove them or empty project to enroll my own? Thank you in advance!

And if possible, can I remove or archive multiple choice of experiments at once? If this is not already implemented, then I'd like to ask it as a feature request. Thank you!

GCloud installation instructions not working

Instructions from install_gcp.md do not work: the following error is thrown:

Creating image "allegro-trains-server" failed. Error: Invalid value for field 'resource.rawDisk.source': 'https://storage.googleapis.com/allegro-files/trains-server/trains-server.vmdk'. The provided source is not a supported file.

As written in the hint for Google Cloud Image,

Your image source must use the .tar.gz extension and the file inside the archive must be named disk.raw

I guess the source should probably be "Virtual Disk (VMDK, VHD) instead of "Google storage file". But then it requires Cloud Build to be enabled (and more permissions from user).

Is the latest docker-compose.yml up-to-date?

Hey,

I'm trying to update the server to 0.13.2. without any luck. I'm working with the docker image and following your upgrade instructions, but I keep installing the 0.13. Any tips?

Thanks in advance,
Majd

Can this be deployed on Mac more smoothly?

I followed the instructions related to Ubuntu but still can't get it work. I don't know what's wrong with it.. sadly.

Using subdomains deployment still goes to subdomain:8008 in some cases

Hello,

I have installed trains-server with Helm on my cluster.

I followed all the instructions in the README.md file and everything seems to work fine.

I have added an ingress to my cluster in order to have my load balancer communicating with trains' services:

# Source: trains/templates/ingress.yaml
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
  name: release-name-trains
  labels:
    app.kubernetes.io/name: trains
    app.kubernetes.io/instance: release-name
    app.kubernetes.io/managed-by: Tiller
  annotations:
    certmanager.k8s.io/cluster-issuer: letsencrypt-prod
    kubernetes.io/ingress.class: nginx

spec:
  tls:
    - hosts:
        - "trainsapp.stage.mydomain.ca"
        - "trainsfiles.stage.mydomain.ca"
        - "trainsapi.stage.mydomain.ca"
      secretName: tls-secret-prod.trainsapp.mydomain.ca
  rules:
    - host: "trainsapp.stage.mydomain.ca"
      http:
        paths:
          - path: /
            backend:
              serviceName: webserver-service
              servicePort: 80
    - host: "trainsapi.stage.mydomain.ca"
      http:
        paths:
          - path: /
            backend:
              serviceName: apiserver-service
              servicePort: 8008
    - host: "trainsfiles.stage.mydomain.ca"
      http:
        paths:
          - path: /
            backend:
              serviceName: fileserver-service
              servicePort: 8081

When I go to trainsapp.stage.mydomain.ca the dashbaord opens and it does seem to work. However, part of the requests try to open trainsapp.stage.mydomain.ca:8008 for some reason (instead of just trainsapp.stage.mydomain.ca):

What am I doing wrong?

Thank you,
Shaked

GCP support?

I want to deploy on GCP.
Do you plan to support?

Why run docker-compose with sudo?

I saw the use of sudo in the linux docs for running the docker and docker-compose up commands. What's the reason for this? It doesn't seem like a good idea. It seems to work fine in my testing without running with sudo. Did you put your user into the docker group after installing docker? That's the way the docker instructions say to run things.

docker-compose config volume bind missing

The bind to /opt/trains/config is missing from the compose yaml files so the apiserver conf isn't applied

train-server get stuck and stops responding

When running trains-server for few days the web server get stuck.
can't access the web gui.
the training jobs seems to get stuck as well (which is the most significant issue for me). I get the errors:

Retrying (Retry(total=82, connect=240, read=82, redirect=240, status=240)) after connection broken by 'ReadTimeoutError("HTTPConnectionPool(host='...', port=8008): Read timed out. (read ti
meout=300.0)")': /v1.5/events.add_batch                                                                                                                                                                            
Retrying (Retry(total=101, connect=240, read=101, redirect=240, status=240)) after connection broken by 'ReadTimeoutError("HTTPConnectionPool(host='...', port=8008): Read timed out. (read 
timeout=300.0)")': /v1.5/models.create                                                                                                                                                                             
Retrying (Retry(total=81, connect=240, read=81, redirect=240, status=240)) after connection broken by 'ReadTimeoutError("HTTPConnectionPool(host='...', port=8008): Read timed out. (read ti
meout=300.0)")': /v1.5/events.add_batch                                                                                                                                                                            
Retrying (Retry(total=81, connect=240, read=81, redirect=240, status=240)) after connection broken by 'ReadTimeoutError("HTTPConnectionPool(host='...', port=8008): Read timed out. (read ti
meout=300.0)")': /v1.9/tasks.get_by_id

when trying to shut down trains-server using docker compose down I get:

Stopping trains-webserver  ... done
Stopping trains-apiserver  ...
Stopping trains-mongo      ... error
Stopping trains-elastic    ... error
Stopping trains-fileserver ... error

ERROR: for trains-apiserver  UnixHTTPConnectionPool(host='localhost', port=None): Read timed out. (read timeout=70)
ERROR: An HTTP request took too long to complete. Retry with --verbose to obtain debug information.
If you encounter this issue regularly because of slow network conditions, consider setting COMPOSE_HTTP_TIMEOUT to a higher value (current value: 60).

ctrl+c in the trains-server session and then docker compose down works.

after the restart trains-server works and all experiments continue.

Recover data after docker-compose down

Hello!

In order to get rid of the bug below, I used docker-compose -f .\docker-compose-win10.yml down and then docker-compose -f .\docker-compose-win10.yml up -d.

Failed logging task to backend (2 lines, <500/100: events.add_batch/v1.0 (General data error: err=('2 document(s) failed to index.', [{'index': {'_index': 'events-log-d1bd92a3b039400cbafc60a7a5b1e52b', '_type': 'event', '_id': '45d72e8d79724292a7f7b0a5f58fb681', 'status': 503, 'error': {'type': 'unavailable_shards_exception', 'reason': '[events-log-d1bd92a3b039400cbafc60a7a, {'index': {'_index': 'events-log-d1bd92a3b039400cbafc60a7a5b1e52b', '_type': 'event', '_id': 'e61e98adaff94753afb633ef67afc017', 'status': 503, 'errouests and a refresh]'}, 'data': {'timr': {'type': 'unavailable_shards_exception', 'reason': '[events-log-d1bd92a3b039400cbafc60a7a5b1e52b][0] primary shard is not active Timeout: [1m], reqted new task id=63f6480b0a9a4d078a80cuest: [BulkShardRequest [[events-log-d1bd92a3b039400cbafc60a7a5b1e52b][0]] containing [2] requests and a refresh]'}, 'data': {'timestamp': 158576280385og', '@timestamp': '2020-04-01T17:40:0, 'type': 'log', 'task': '63f6480b0a9a4d078a80c8748f27fc65', 'level': 'info', 'worker': 'pa-barbosa01', 'msg': 'Train for 10 steps, validate for 263 s3afb633ef67afc017', 'status': 503, 'eteps\nEpoch 1/5', '@timestamp': '2020-04-01T17:40:04.236Z', 'metric': '', 'variant': ''}}}]), extra_info=[events-log-d1bd92a3b039400cbafc60a7a5b1e52b][request: [BulkShardRequest [[events-l0] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[events-log-d1bd92a3b039400cbafc60a7a5b1e52b][0]] containing [2] requests andb0a9a4d078a80c8748f27fc65', 'level':  a refresh])>)

After that, I don't see any of all the experiments I have done until now. I thought that the services' containers had all the data mapped into the host filesystem (c:\opt\trains in my case).

How can I recover my experiments' data?

The meaning of status 'published'

This tool is really great!!
I have a question about status 'published' in an experiment. What does the server do when I publish an experiment? Is this just a tag label for an experiment? or any other stuffs going on behind the scene?

Is there a way to retroactively upload an experiment, e.g. due to communication problems?

Consider the following scenario:
I have a training machine, running and experiment, and network communication for this machine is off, for some reason.
In this case I am not able to update the trains-server with the running experiment.
It could be nice, if I could retroactively send the missing experiment results (in TF it is mainly the events files).
Alternatively I could upload a new experiment completely from scratch.
This can be done either from the trains API in the client, or by some UI option in the trains-server.

Feature request: Display errors encountered by the experiment in the UI

** Feature request **

When running an experiment code, the UI displayed an error from the server (500), but with no details regarding the cause. After exploring the logs I found out that the fileserver crashed because the file name was too long.

I would love a way for the UI to clearly display all kinds of errors encountered by the experiment (including, but not limited to, file names being too long...)

The solution I would like

I would rather get a message in the UI saying that the file name is too big, rather than have to look for the issue in the logs

Additional context

Logs from /opt/trains/logs/fileserver.log

[2020-07-12 14:22:04,977] [7] [ERROR] [fileserver] Exception on / [POST]
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/flask/app.py", line 2447, in wsgi_app
    response = self.full_dispatch_request()
  File "/usr/local/lib/python3.6/site-packages/flask/app.py", line 1952, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/usr/local/lib/python3.6/site-packages/flask_cors/extension.py", line 161, in wrapped_function
    return cors_after_request(app.make_response(f(*args, **kwargs)))
  File "/usr/local/lib/python3.6/site-packages/flask/app.py", line 1821, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/usr/local/lib/python3.6/site-packages/flask/_compat.py", line 39, in reraise
    raise value
  File "/usr/local/lib/python3.6/site-packages/flask/app.py", line 1950, in full_dispatch_request
    rv = self.dispatch_request()
  File "/usr/local/lib/python3.6/site-packages/flask/app.py", line 1936, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "fileserver.py", line 32, in upload
    file.save(str(target))
  File "/usr/local/lib/python3.6/site-packages/werkzeug/datastructures.py", line 3066, in save
    dst = open(dst, "wb")
OSError: [Errno 36] File name too long: '/mnt/fileserver/trains/14-trains.bd72a5a2afdhsy2aa0acc3dca21b9b5f/metrics/Evaluator CV_no_my_real_name_no_my_real_name_no_my_real_name_len_512_Jul12_14-20-42_merge__no_my_real_nameh_sz_8__no_my_real_name_sz_8_lr_1e-06_w_decay_0.0_warm_up_50_in_sz_768_hid_sz_256_word_aug_p_0.0_no_my_real_name_1__no_my_real_name/_no_my_real_name _no_my_real_name_no_my_real_namele layer__no_my_real_namelen_512_Jul12_14-20-42__no_my_real_name_batch_sz_8_test_batch_sz_8_lr_1e-06_w_d_no_my_real_name0_in_sz_768_hid__no_my_real_name_0.0_word_aug_min_1_imba_no_my_real_name__no_my_real_name_00000000.jpeg'

Security vulnerability with default setup

The current setup guide for Linux (here) is unsafe.
It's not your role to take care of the server's security, but what do you think about adding a comment at the end about it?

After just a week, one of our servers got infected by the kinsing malware, a cryptocurrency miner.
The issue has been documented here

It could be avoided easily by for example setting up the firewall on the server to prevent access to the redis instance:

sudo ufw default deny incoming
sudo ufw default allow outgoing
sudo ufw allow ssh
sudo ufw allow 8080
sudo ufw allow 8081
sudo ufw allow 8008
sudo ufw enable

Using a path instead of a port for api server

Hi, I can only expose port 80 and 443(https). Port 8008 is not permitted because the way I have a kubernetes cluster set up. I can add rules by path or another subdomain for serving the api. But I see that there is no argument or environment variable to set the api url when running the web server (static files served with nginx). I would like to request having an environment variable or argument to update the api in the webserver. Or is there a work around on how to set up an ingress in kubernetes for trains?

Thanks for the great app!

Add support for Windows 10

AFAIK, Currently it is supported only on Linux.
From readme: "For Windows users, we recommend running the server on a Linux virtual machine." but it would be great to do without this workaround.

Additionally, it might be a good idea to mention this issue on the main Github page of TRAINS.

Thanks for the great project!

Failed to create credentials on linux

I tried to install trains-server on Ubuntu 20.04.
I checked port sudo lsof -Pn -i4 | grep :8080 | grep LISTEN - Nothing was find.
I used instruction (https://github.com/allegroai/trains-server/blob/master/docs/install_linux_mac.md) step-by-step (excepting step 8. Instead 1000 I used my username)

trains_log.txt - sudo docker-compose -f docker-compose.yml up

trains_log1.txt - sudo HOSTIP=172.17.0.1 docker-compose -f docker-compose.yml up

There are multiple errors in logs.
Also I can't create credentials on localhost:8080

how to sense if trains is currently tracking experiment ?

The trains can be imported by the current process, that may be tracking or may not be tracking.
In this case some external source code may want to change its behaviour if the current run is not tracked (perform something else than api call)

Min / Max Values are not displaying on dropdown menu

A minor UI bug:

Select Min Values or Max Values in the drop-down menu on the Scalars tab in the Compare Experiment page (demo server as an example)
The comparison tables display values as expected, however the dropdown menu still says 'Last Values', which is a bit confusing.

Cannot create Credentials

I use GCP for trains-server internal GCP docker compose up and port-forward in my local machine,
when I view the http:localhost:8080/profile page Create new credentials Error occured.

how to ssh login to AMI instance?

I'm trying to use your AMI version of trains server, and it looks well accessible via web, but not via ssh.
What's the default user name of the image (both of the community version and the AWS marketplace version)?

Sub-domain configuration and authentication

Hi
i used the AMI to deplot to AWS. but i can see Everyone can login .
is there a guide for authentication ?
also - in the AMI deployment , where exactly is the trains-init script , could not find it , not even in the docker instances (webserver/api)
thanks
Shlomi

preserve plots layout on refresh

When refreshing plots page, the graph's view resets. It would be very convenient if the plot perspective is retained (e.g., zoom level, min/max XY axis view, etc.)

How to run trains on local server.

I have installed trains server on my Ubuntu machine using this link (https://github.com/allegroai/trains-server/blob/master/docs/install_linux_mac.md) . I am currently running a simple example code https://github.com/allegroai/trains/blob/master/examples/automl/toy_base_task.py. I am not sure how to run trains-init or where to find and edit ~/trains.conf so that I can run stuff on local server. Can anyone describe it briefly.
Thanks in advance

Help - can't pass the login page

Hi, I would really appreciate any help :)

I had a running trains server that was deployed using docker-compose on a linux machine.
It worked nicely just the other day and suddenly I can't login into the application (can't pass the type and START).
I tried to reinstall it, but nothing helps. Any Idea where the issue can be?
I can't find anything in the apiserver.log .

leaderboard issues

Hey!
In order to create leaderboard we need 2 things:

could not sort experiments by scalars added to table(auc) - the sort is working on other columns like date , iteration but not on added scalars.
The values of the scalars must be the best and not the last

`Clear` button

Feature request

Hi! I would love to see a clear UI button next to the edit button to easily clear the installed python package list before enqueuing an experiment.

Use case

I install forked packages with Docker with a

git+git://github.com/MyPublicRepo/albumentations@master#egg=albumentations
git+https://${GITHUB_TOKEN}:@github.com/MySecretRepo/numpy@master#egg=numpy

line in requirements.txt. If I don't clear the package list, trains-agent will install the official package from pypi. Right now I click edit, select all packages and clear them manually. Not much effort but could be better :)

Alternative solution

Add source url for packages installed not from pypi. Like the exact github link.

Uploading checkpoints models to server

Is it possible to upload checkpoints model to the trains-server instance?
The idea is to not be dependent on the instances as they might come and go and their storage could be erased, but to use the server as the static element in the chain.

As of now, it seems like models are stored on the agents only.
I also was unable to download any of them directly through the web interface.

Docker open ports

Hi,

I launched the trains-server using docker installation on my dedicated server with Ubuntu 18.04. Next day I noticed xmr miner on redis docker. Solved this problem by managing iptables.
Is there another way to secure installation?

secret key issues containing *

Hey!
After some tries, we still have issues with secret key having *
Is it possible just to create the secret key without *?

It affects us when we are using env variables and not conf file

Thanks!

Adding user password and restricting user creation

Thank you for this wonderful tool!

Is there a way to add user passwords or any other auth system?
I have changed verify_user_tokens to true in the config, but I don't see the difference.

Also once we have created the users we want, it is possible to prevent the creation of new users?

Thank you

Deploying through docker-compose

Hi.
I try to deploy trains-server to new clear machine (Ubuntu 18.04 LTS).
First of all I downloaded source archive from releases (0.14.1)
Untared it.
And run docker-compose up.

Docker version 19.03.8, build afacb8b7f0
docker-compose version 1.25.4, build 8d51620a

And received errors from trains-elastic

trains-elastic   | [2020-03-26T09:19:10,496][INFO ][o.e.n.Node               ] [trains] initializing ...
trains-elastic   | [2020-03-26T09:19:10,538][WARN ][o.e.b.ElasticsearchUncaughtExceptionHandler] [trains] uncaught exception in thread [main]
trains-elastic   | org.elasticsearch.bootstrap.StartupException: java.lang.IllegalStateException: Failed to create node environment
trains-elastic   | 	at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:136) ~[elasticsearch-5.6.16.jar:5.6.16]
trains-elastic   | 	at org.elasticsearch.bootstrap.Elasticsearch.execute(Elasticsearch.java:123) ~[elasticsearch-5.6.16.jar:5.6.16]
trains-elastic   | 	at org.elasticsearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:70) ~[elasticsearch-5.6.16.jar:5.6.16]
trains-elastic   | 	at org.elasticsearch.cli.Command.mainWithoutErrorHandling(Command.java:134) ~[elasticsearch-5.6.16.jar:5.6.16]
trains-elastic   | 	at org.elasticsearch.cli.Command.main(Command.java:90) ~[elasticsearch-5.6.16.jar:5.6.16]
trains-elastic   | 	at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:91) ~[elasticsearch-5.6.16.jar:5.6.16]
trains-elastic   | 	at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:84) ~[elasticsearch-5.6.16.jar:5.6.16]
trains-elastic   | Caused by: java.lang.IllegalStateException: Failed to create node environment
trains-elastic   | 	at org.elasticsearch.node.Node.<init>(Node.java:268) ~[elasticsearch-5.6.16.jar:5.6.16]
trains-elastic   | 	at org.elasticsearch.node.Node.<init>(Node.java:245) ~[elasticsearch-5.6.16.jar:5.6.16]
trains-elastic   | 	at org.elasticsearch.bootstrap.Bootstrap$5.<init>(Bootstrap.java:233) ~[elasticsearch-5.6.16.jar:5.6.16]
trains-elastic   | 	at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:233) ~[elasticsearch-5.6.16.jar:5.6.16]
trains-elastic   | 	at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:342) ~[elasticsearch-5.6.16.jar:5.6.16]
trains-elastic   | 	at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:132) ~[elasticsearch-5.6.16.jar:5.6.16]
trains-elastic   | 	... 6 more
trains-elastic   | Caused by: java.nio.file.AccessDeniedException: /usr/share/elasticsearch/data/nodes
trains-elastic   | 	at sun.nio.fs.UnixException.translateToIOException(UnixException.java:84) ~[?:?]
trains-elastic   | 	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) ~[?:?]
trains-elastic   | 	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) ~[?:?]
trains-elastic   | 	at sun.nio.fs.UnixFileSystemProvider.createDirectory(UnixFileSystemProvider.java:384) ~[?:?]
trains-elastic   | 	at java.nio.file.Files.createDirectory(Files.java:674) ~[?:1.8.0_201]
trains-elastic   | 	at java.nio.file.Files.createAndCheckIsDirectory(Files.java:781) ~[?:1.8.0_201]
trains-elastic   | 	at java.nio.file.Files.createDirectories(Files.java:767) ~[?:1.8.0_201]
trains-elastic   | 	at org.elasticsearch.env.NodeEnvironment.<init>(NodeEnvironment.java:225) ~[elasticsearch-5.6.16.jar:5.6.16]
trains-elastic   | 	at org.elasticsearch.node.Node.<init>(Node.java:265) ~[elasticsearch-5.6.16.jar:5.6.16]
trains-elastic   | 	at org.elasticsearch.node.Node.<init>(Node.java:245) ~[elasticsearch-5.6.16.jar:5.6.16]
trains-elastic   | 	at org.elasticsearch.bootstrap.Bootstrap$5.<init>(Bootstrap.java:233) ~[elasticsearch-5.6.16.jar:5.6.16]
trains-elastic   | 	at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:233) ~[elasticsearch-5.6.16.jar:5.6.16]
trains-elastic   | 	at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:342) ~[elasticsearch-5.6.16.jar:5.6.16]
trains-elastic   | 	at org.elasticsearch.bootstrap.Elasticsearch.init(Elasticsearch.java:132) ~[elasticsearch-5.6.16.jar:5.6.16]
trains-elastic   | 	... 6 more
trains-elastic exited with code 1

and from trains-apiserver

trains-apiserver | [2020-03-26 09:19:40,206] [8] [WARNING] [urllib3.connectionpool] Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f829a354780>: Failed to establish a new connection: [Errno -2] Name or service not known',)': /_template/queue_metrics
trains-apiserver | Loading config from /opt/trains/server/config/default
trains-apiserver | Loading config from file /opt/trains/server/config/default/apiserver.conf
trains-apiserver | Loading config from file /opt/trains/server/config/default/hosts.conf
trains-apiserver | Loading config from file /opt/trains/server/config/default/logging.conf
trains-apiserver | Loading config from file /opt/trains/server/config/default/secure.conf
trains-apiserver | Loading config from file /opt/trains/server/config/default/services/events.conf
trains-apiserver | Loading config from file /opt/trains/server/config/default/services/tasks.conf
trains-apiserver | Loading config from /opt/trains/config
trains-apiserver | Traceback (most recent call last):
trains-apiserver |   File "/usr/local/lib/python3.6/site-packages/urllib3/connection.py", line 157, in _new_conn
trains-apiserver |     (self._dns_host, self.port), self.timeout, **extra_kw
trains-apiserver |   File "/usr/local/lib/python3.6/site-packages/urllib3/util/connection.py", line 61, in create_connection
trains-apiserver |     for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
trains-apiserver |   File "/usr/lib64/python3.6/socket.py", line 745, in getaddrinfo
trains-apiserver |     for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
trains-apiserver | socket.gaierror: [Errno -2] Name or service not known
trains-apiserver | 
trains-apiserver | During handling of the above exception, another exception occurred:
trains-apiserver | 
trains-apiserver | Traceback (most recent call last):
trains-apiserver |   File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 672, in urlopen
trains-apiserver |     chunked=chunked,
trains-apiserver |   File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 387, in _make_request
trains-apiserver |     conn.request(method, url, **httplib_request_kw)
trains-apiserver |   File "/usr/lib64/python3.6/http/client.py", line 1254, in request
trains-apiserver |     self._send_request(method, url, body, headers, encode_chunked)
trains-apiserver |   File "/usr/lib64/python3.6/http/client.py", line 1300, in _send_request
trains-apiserver |     self.endheaders(body, encode_chunked=encode_chunked)
trains-apiserver |   File "/usr/lib64/python3.6/http/client.py", line 1249, in endheaders
trains-apiserver |     self._send_output(message_body, encode_chunked=encode_chunked)
trains-apiserver |   File "/usr/lib64/python3.6/http/client.py", line 1036, in _send_output
trains-apiserver |     self.send(msg)
trains-apiserver |   File "/usr/lib64/python3.6/http/client.py", line 974, in send
trains-apiserver |     self.connect()
trains-apiserver |   File "/usr/local/lib/python3.6/site-packages/urllib3/connection.py", line 184, in connect
trains-apiserver |     conn = self._new_conn()
trains-apiserver |   File "/usr/local/lib/python3.6/site-packages/urllib3/connection.py", line 169, in _new_conn
trains-apiserver |     self, "Failed to establish a new connection: %s" % e
trains-apiserver | urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f829a32f7f0>: Failed to establish a new connection: [Errno -2] Name or service not known
trains-apiserver | 
trains-apiserver | During handling of the above exception, another exception occurred:
trains-apiserver | 
trains-apiserver | Traceback (most recent call last):
trains-apiserver |   File "/usr/local/lib/python3.6/site-packages/requests/adapters.py", line 449, in send
trains-apiserver |     timeout=timeout
trains-apiserver |   File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 760, in urlopen
trains-apiserver |     **response_kw
trains-apiserver |   File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 760, in urlopen
trains-apiserver |     **response_kw
trains-apiserver |   File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 760, in urlopen
trains-apiserver |     **response_kw
trains-apiserver |   [Previous line repeated 2 more times]
trains-apiserver |   File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 720, in urlopen
trains-apiserver |     method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
trains-apiserver |   File "/usr/local/lib/python3.6/site-packages/urllib3/util/retry.py", line 436, in increment
trains-apiserver |     raise MaxRetryError(_pool, url, error or ResponseError(cause))
trains-apiserver | urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='elasticsearch', port=9200): Max retries exceeded with url: /_template/queue_metrics (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f829a32f7f0>: Failed to establish a new connection: [Errno -2] Name or service not known',))
trains-apiserver | 
trains-apiserver | During handling of the above exception, another exception occurred:
trains-apiserver | 
trains-apiserver | Traceback (most recent call last):
trains-apiserver |   File "server.py", line 36, in <module>
trains-apiserver |     init_es_data()
trains-apiserver |   File "/opt/trains/server/elastic/initialize.py", line 26, in init_es_data
trains-apiserver |     res = apply_mappings_to_host(host)
trains-apiserver |   File "/opt/trains/server/elastic/apply_mappings.py", line 37, in apply_mappings_to_host
trains-apiserver |     _send_mapping(f) for f in p.iterdir() if f.is_file() and f.suffix == ".json"
trains-apiserver |   File "/opt/trains/server/elastic/apply_mappings.py", line 37, in <listcomp>
trains-apiserver |     _send_mapping(f) for f in p.iterdir() if f.is_file() and f.suffix == ".json"
trains-apiserver |   File "/opt/trains/server/elastic/apply_mappings.py", line 27, in _send_mapping
trains-apiserver |     session.delete(url)
trains-apiserver |   File "/usr/local/lib/python3.6/site-packages/requests/sessions.py", line 612, in delete
trains-apiserver |     return self.request('DELETE', url, **kwargs)
trains-apiserver |   File "/usr/local/lib/python3.6/site-packages/requests/sessions.py", line 530, in request
trains-apiserver |     resp = self.send(prep, **send_kwargs)
trains-apiserver |   File "/usr/local/lib/python3.6/site-packages/requests/sessions.py", line 643, in send
trains-apiserver |     r = adapter.send(request, **kwargs)
trains-apiserver |   File "/usr/local/lib/python3.6/site-packages/requests/adapters.py", line 516, in send
trains-apiserver |     raise ConnectionError(e, request=request)
trains-apiserver | requests.exceptions.ConnectionError: HTTPConnectionPool(host='elasticsearch', port=9200): Max retries exceeded with url: /_template/queue_metrics (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f829a32f7f0>: Failed to establish a new connection: [Errno -2] Name or service not known',))
trains-apiserver exited with code 1
trains-apiserver | [2020-03-26 09:19:41,625] [8] [INFO] [trains.es_factory] Using override elastic host elasticsearch
trains-apiserver | [2020-03-26 09:19:41,625] [8] [INFO] [trains.es_factory] Using override elastic port 9200
trains-apiserver | [2020-03-26 09:19:41,870] [8] [INFO] [trains.redis_manager] Using override redis host redis
trains-apiserver | [2020-03-26 09:19:41,871] [8] [INFO] [trains.redis_manager] Using override redis port 6379
trains-apiserver | [2020-03-26 09:19:41,975] [8] [INFO] [trains.schema] loading schema from cache
trains-apiserver | [2020-03-26 09:19:41,994] [8] [INFO] [trains.server] ################ API Server initializing #####################
trains-apiserver | [2020-03-26 09:19:41,994] [8] [INFO] [trains.database] Initializing database connections
trains-apiserver | [2020-03-26 09:19:41,995] [8] [INFO] [trains.database] Using override mongodb host mongo
trains-apiserver | [2020-03-26 09:19:41,995] [8] [INFO] [trains.database] Using override mongodb port 27017
trains-apiserver | [2020-03-26 09:19:41,997] [8] [INFO] [trains.database] Registering connection to auth-db (mongodb://mongo:27017/auth)
trains-apiserver | [2020-03-26 09:19:41,998] [8] [INFO] [trains.database] Registering connection to backend-db (mongodb://mongo:27017/backend)
trains-apiserver | [2020-03-26 09:19:41,999] [8] [INFO] [trains.initialize] Applying mappings to host: http://elasticsearch:9200
trains-apiserver | [2020-03-26 09:19:42,016] [8] [WARNING] [urllib3.connectionpool] Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f0afda9eeb8>: Failed to establish a new connection: [Errno -2] Name or service not known',)': /_template/queue_metrics
trains-apiserver | [2020-03-26 09:19:43,031] [8] [WARNING] [urllib3.connectionpool] Retrying (Retry(total=3, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f0afdaae160>: Failed to establish a new connection: [Errno -2] Name or service not known',)': /_template/queue_metrics
trains-apiserver | [2020-03-26 09:19:45,049] [8] [WARNING] [urllib3.connectionpool] Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f0afdaae2e8>: Failed to establish a new connection: [Errno -2] Name or service not known',)': /_template/queue_metrics
trains-apiserver | [2020-03-26 09:19:49,071] [8] [WARNING] [urllib3.connectionpool] Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f0afdaae470>: Failed to establish a new connection: [Errno -2] Name or service not known',)': /_template/queue_metrics
trains-apiserver | [2020-03-26 09:19:57,094] [8] [WARNING] [urllib3.connectionpool] Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f0afdaae7f0>: Failed to establish a new connection: [Errno -2] Name or service not known',)': /_template/queue_metrics
trains-apiserver | Loading config from /opt/trains/server/config/default
trains-apiserver | Loading config from file /opt/trains/server/config/default/apiserver.conf
trains-apiserver | Loading config from file /opt/trains/server/config/default/hosts.conf
trains-apiserver | Loading config from file /opt/trains/server/config/default/logging.conf
trains-apiserver | Loading config from file /opt/trains/server/config/default/secure.conf
trains-apiserver | Loading config from file /opt/trains/server/config/default/services/events.conf
trains-apiserver | Loading config from file /opt/trains/server/config/default/services/tasks.conf
trains-apiserver | Loading config from /opt/trains/config
trains-apiserver | Traceback (most recent call last):
trains-apiserver |   File "/usr/local/lib/python3.6/site-packages/urllib3/connection.py", line 157, in _new_conn
trains-apiserver |     (self._dns_host, self.port), self.timeout, **extra_kw
trains-apiserver |   File "/usr/local/lib/python3.6/site-packages/urllib3/util/connection.py", line 61, in create_connection
trains-apiserver |     for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
trains-apiserver |   File "/usr/lib64/python3.6/socket.py", line 745, in getaddrinfo
trains-apiserver |     for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
trains-apiserver | socket.gaierror: [Errno -2] Name or service not known
trains-apiserver | 
trains-apiserver | During handling of the above exception, another exception occurred:
trains-apiserver | 
trains-apiserver | Traceback (most recent call last):
trains-apiserver |   File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 672, in urlopen
trains-apiserver |     chunked=chunked,
trains-apiserver |   File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 387, in _make_request
trains-apiserver |     conn.request(method, url, **httplib_request_kw)
trains-apiserver |   File "/usr/lib64/python3.6/http/client.py", line 1254, in request
trains-apiserver |     self._send_request(method, url, body, headers, encode_chunked)
trains-apiserver |   File "/usr/lib64/python3.6/http/client.py", line 1300, in _send_request
trains-apiserver |     self.endheaders(body, encode_chunked=encode_chunked)
trains-apiserver |   File "/usr/lib64/python3.6/http/client.py", line 1249, in endheaders
trains-apiserver |     self._send_output(message_body, encode_chunked=encode_chunked)
trains-apiserver |   File "/usr/lib64/python3.6/http/client.py", line 1036, in _send_output
trains-apiserver |     self.send(msg)
trains-apiserver |   File "/usr/lib64/python3.6/http/client.py", line 974, in send
trains-apiserver |     self.connect()
trains-apiserver |   File "/usr/local/lib/python3.6/site-packages/urllib3/connection.py", line 184, in connect
trains-apiserver |     conn = self._new_conn()
trains-apiserver |   File "/usr/local/lib/python3.6/site-packages/urllib3/connection.py", line 169, in _new_conn
trains-apiserver |     self, "Failed to establish a new connection: %s" % e
trains-apiserver | urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f0afda88860>: Failed to establish a new connection: [Errno -2] Name or service not known
trains-apiserver | 
trains-apiserver | During handling of the above exception, another exception occurred:
trains-apiserver | 
trains-apiserver | Traceback (most recent call last):
trains-apiserver |   File "/usr/local/lib/python3.6/site-packages/requests/adapters.py", line 449, in send
trains-apiserver |     timeout=timeout
trains-apiserver |   File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 760, in urlopen
trains-apiserver |     **response_kw
trains-apiserver |   File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 760, in urlopen
trains-apiserver |     **response_kw
trains-apiserver |   File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 760, in urlopen
trains-apiserver |     **response_kw
trains-apiserver |   [Previous line repeated 2 more times]
trains-apiserver |   File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 720, in urlopen
trains-apiserver |     method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
trains-apiserver |   File "/usr/local/lib/python3.6/site-packages/urllib3/util/retry.py", line 436, in increment
trains-apiserver |     raise MaxRetryError(_pool, url, error or ResponseError(cause))
trains-apiserver | urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='elasticsearch', port=9200): Max retries exceeded with url: /_template/queue_metrics (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f0afda88860>: Failed to establish a new connection: [Errno -2] Name or service not known',))
trains-apiserver | 
trains-apiserver | During handling of the above exception, another exception occurred:
trains-apiserver | 
trains-apiserver | Traceback (most recent call last):
trains-apiserver |   File "server.py", line 36, in <module>
trains-apiserver |     init_es_data()
trains-apiserver |   File "/opt/trains/server/elastic/initialize.py", line 26, in init_es_data
trains-apiserver |     res = apply_mappings_to_host(host)
trains-apiserver |   File "/opt/trains/server/elastic/apply_mappings.py", line 37, in apply_mappings_to_host
trains-apiserver |     _send_mapping(f) for f in p.iterdir() if f.is_file() and f.suffix == ".json"
trains-apiserver |   File "/opt/trains/server/elastic/apply_mappings.py", line 37, in <listcomp>
trains-apiserver |     _send_mapping(f) for f in p.iterdir() if f.is_file() and f.suffix == ".json"
trains-apiserver |   File "/opt/trains/server/elastic/apply_mappings.py", line 27, in _send_mapping
trains-apiserver |     session.delete(url)
trains-apiserver |   File "/usr/local/lib/python3.6/site-packages/requests/sessions.py", line 612, in delete
trains-apiserver |     return self.request('DELETE', url, **kwargs)
trains-apiserver |   File "/usr/local/lib/python3.6/site-packages/requests/sessions.py", line 530, in request
trains-apiserver |     resp = self.send(prep, **send_kwargs)
trains-apiserver |   File "/usr/local/lib/python3.6/site-packages/requests/sessions.py", line 643, in send
trains-apiserver |     r = adapter.send(request, **kwargs)
trains-apiserver |   File "/usr/local/lib/python3.6/site-packages/requests/adapters.py", line 516, in send
trains-apiserver |     raise ConnectionError(e, request=request)
trains-apiserver | requests.exceptions.ConnectionError: HTTPConnectionPool(host='elasticsearch', port=9200): Max retries exceeded with url: /_template/queue_metrics (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f0afda88860>: Failed to establish a new connection: [Errno -2] Name or service not known',))
trains-apiserver exited with code 1

Hm, I said. And went to google. There was nothing about it very useful. But they suggested to chown trains directory. I tried.
Logs from trains-elastic changed

trains-elastic   | [2020-03-26T09:23:45,378][INFO ][o.e.p.PluginsService     ] [trains] loaded module [aggs-matrix-stats]
trains-elastic   | [2020-03-26T09:23:45,379][INFO ][o.e.p.PluginsService     ] [trains] loaded module [ingest-common]
trains-elastic   | [2020-03-26T09:23:45,379][INFO ][o.e.p.PluginsService     ] [trains] loaded module [lang-expression]
trains-elastic   | [2020-03-26T09:23:45,379][INFO ][o.e.p.PluginsService     ] [trains] loaded module [lang-groovy]
trains-elastic   | [2020-03-26T09:23:45,380][INFO ][o.e.p.PluginsService     ] [trains] loaded module [lang-mustache]
trains-elastic   | [2020-03-26T09:23:45,380][INFO ][o.e.p.PluginsService     ] [trains] loaded module [lang-painless]
trains-elastic   | [2020-03-26T09:23:45,380][INFO ][o.e.p.PluginsService     ] [trains] loaded module [parent-join]
trains-elastic   | [2020-03-26T09:23:45,381][INFO ][o.e.p.PluginsService     ] [trains] loaded module [percolator]
trains-elastic   | [2020-03-26T09:23:45,381][INFO ][o.e.p.PluginsService     ] [trains] loaded module [reindex]
trains-elastic   | [2020-03-26T09:23:45,381][INFO ][o.e.p.PluginsService     ] [trains] loaded module [transport-netty3]
trains-elastic   | [2020-03-26T09:23:45,382][INFO ][o.e.p.PluginsService     ] [trains] loaded module [transport-netty4]
trains-elastic   | [2020-03-26T09:23:45,382][INFO ][o.e.p.PluginsService     ] [trains] loaded plugin [ingest-geoip]
trains-elastic   | [2020-03-26T09:23:45,389][INFO ][o.e.p.PluginsService     ] [trains] loaded plugin [ingest-user-agent]
trains-elastic   | [2020-03-26T09:23:45,390][INFO ][o.e.p.PluginsService     ] [trains] loaded plugin [x-pack]
trains-apiserver | [2020-03-26 09:23:46,247] [8] [INFO] [trains.es_factory] Using override elastic host elasticsearch
trains-apiserver | [2020-03-26 09:23:46,248] [8] [INFO] [trains.es_factory] Using override elastic port 9200
trains-apiserver | [2020-03-26 09:23:46,742] [8] [INFO] [trains.redis_manager] Using override redis host redis
trains-apiserver | [2020-03-26 09:23:46,743] [8] [INFO] [trains.redis_manager] Using override redis port 6379
trains-apiserver | [2020-03-26 09:23:46,968] [8] [INFO] [trains.schema] loading schema from cache
trains-apiserver | [2020-03-26 09:23:47,004] [8] [INFO] [trains.server] ################ API Server initializing #####################
trains-apiserver | [2020-03-26 09:23:47,010] [8] [INFO] [trains.database] Initializing database connections
trains-apiserver | [2020-03-26 09:23:47,010] [8] [INFO] [trains.database] Using override mongodb host mongo
trains-apiserver | [2020-03-26 09:23:47,011] [8] [INFO] [trains.database] Using override mongodb port 27017
trains-apiserver | [2020-03-26 09:23:47,012] [8] [INFO] [trains.database] Registering connection to auth-db (mongodb://mongo:27017/auth)
trains-apiserver | [2020-03-26 09:23:47,014] [8] [INFO] [trains.database] Registering connection to backend-db (mongodb://mongo:27017/backend)
trains-apiserver | [2020-03-26 09:23:47,018] [8] [INFO] [trains.initialize] Applying mappings to host: http://elasticsearch:9200
trains-apiserver | [2020-03-26 09:23:47,026] [8] [WARNING] [urllib3.connectionpool] Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f6f66590eb8>: Failed to establish a new connection: [Errno 111] Connection refused',)': /_template/queue_metrics
trains-elastic   | [2020-03-26T09:23:47,252][WARN ][o.e.d.c.s.Settings       ] [script.inline] setting was deprecated in Elasticsearch and will be removed in a future release! See the breaking changes documentation for the next major version.
trains-elastic   | [2020-03-26T09:23:47,253][WARN ][o.e.d.c.s.Settings       ] [script.update] setting was deprecated in Elasticsearch and will be removed in a future release! See the breaking changes documentation for the next major version.
trains-apiserver | [2020-03-26 09:23:48,028] [8] [WARNING] [urllib3.connectionpool] Retrying (Retry(total=3, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f6f6659e160>: Failed to establish a new connection: [Errno 111] Connection refused',)': /_template/queue_metrics
trains-elastic   | [2020-03-26T09:23:49,699][INFO ][o.e.x.m.j.p.l.CppLogMessageHandler] [controller/45] [Main.cc@128] controller (64 bit): Version 5.6.16 (Build 9ed4c28f2a8755) Copyright (c) 2019 Elasticsearch BV
trains-elastic   | [2020-03-26T09:23:49,771][INFO ][o.e.d.DiscoveryModule    ] [trains] using discovery type [zen]
trains-apiserver | [2020-03-26 09:23:50,032] [8] [WARNING] [urllib3.connectionpool] Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f6f6659e2e8>: Failed to establish a new connection: [Errno 111] Connection refused',)': /_template/queue_metrics
trains-elastic   | [2020-03-26T09:23:51,016][INFO ][o.e.n.Node               ] [trains] initialized
trains-elastic   | [2020-03-26T09:23:51,020][INFO ][o.e.n.Node               ] [trains] starting ...
trains-elastic   | [2020-03-26T09:23:51,383][INFO ][o.e.t.TransportService   ] [trains] publish_address {172.24.0.5:9300}, bound_addresses {0.0.0.0:9300}
trains-elastic   | [2020-03-26T09:23:51,407][INFO ][o.e.b.BootstrapChecks    ] [trains] bound or publishing to a non-loopback address, enforcing bootstrap checks
trains-elastic   | ERROR: [1] bootstrap checks failed
trains-elastic   | [1]: max virtual memory areas vm.max_map_count [65530] is too low, increase to at least [262144]
trains-elastic   | [2020-03-26T09:23:51,437][INFO ][o.e.n.Node               ] [trains] stopping ...
trains-elastic   | [2020-03-26T09:23:51,470][INFO ][o.e.n.Node               ] [trains] stopped
trains-elastic   | [2020-03-26T09:23:51,471][INFO ][o.e.n.Node               ] [trains] closing ...
trains-elastic   | [2020-03-26T09:23:51,491][INFO ][o.e.n.Node               ] [trains] closed
trains-elastic exited with code 78

but from apiserver not.

There all my ideas gone away. Can you help my?

apiserver.log