There are 2 methods to reproduce my error.
It's ok to launch a Notebook and run the sample notebook on Web UI, but it raised ERRORED while pushing the MNIST example code in QUICK START GUIDE by cli.
After running above commands, there would be a new experiment on WebUI Dashboard. Several minutes after it turned from "ACTIVE" into "ERRORED" status.
You may need to run following command in Notebook terminal to avoid "ImportError: IntProgress not found.":
<info> [2021-05-06, 09:04:58] master configuration: {"config_file":"","log":{"level":"info","color":true},"db":{"user":"postgres","password":"********","migrations":"file:///usr/share/determined/master/static/migrations","host":"determined-db-service-determined-ai","port":"5432","name":"determined","ssl_mode":"disable","ssl_root_cert":""},"tensorboard_timeout":300,"security":{"default_task":{"id":0,"user_id":0,"user":"root","uid":0,"group":"root","gid":0},"tls":{"cert":"","key":""}},"checkpoint_storage":{"host_path":"/checkpoints","save_experiment_best":0,"save_trial_best":1,"save_trial_latest":1,"type":"shared_fs"},"task_container_defaults":{"shm_size_bytes":4294967296,"network_mode":"bridge","cpu_pod_spec":null,"gpu_pod_spec":null,"add_capabilities":null,"drop_capabilities":null,"devices":null},"port":8081,"harness_path":"/opt/determined","root":"/usr/share/determined/master","telemetry":{"enabled":true,"segment_master_key":"********","segment_webui_key":"********"},"enable_cors":false,"cluster_name":"","logging":{"type":"default"},"hyperparameter_importance":{"workers_limit":2,"queue_limit":16,"cores_per_worker":1,"max_trees":100},"resource_manager":{"default_scheduler":"","leave_kubernetes_resources":false,"master_service_name":"determined-master-service-determined-ai","max_slots_per_pod":2,"namespace":"determined","type":"kubernetes"},"resource_pools":null}
<info> [2021-05-06, 09:04:58] Determined master 0.15.3 (built with go1.16.3)
<info> [2021-05-06, 09:04:58] connecting to database determined-db-service-determined-ai:5432
<info> [2021-05-06, 09:04:58] running migrations from file:///usr/share/determined/master/static/migrations
<info> [2021-05-06, 09:04:58] found golang-migrate version 20210322160616
<info> [2021-05-06, 09:04:58] deleting all snapshots for terminal state experiments
<info> [2021-05-06, 09:04:58] initializing endpoints for pods
<info> [2021-05-06, 09:04:58] kubernetes clientSet initialized id="pods" system="master" type="pods"
<info> [2021-05-06, 09:04:58] scheduling next resource allocation aggregation in 14h56m1s at 2021-05-07 00:01:00 +0000 UTC id="allocation-aggregator" system="master" type="allocationAggregator"
<info> [2021-05-06, 09:04:58] master URL set to 192.168.245.182:8080 id="pods" system="master" type="pods"
<info> [2021-05-06, 09:04:58] telemetry reporting is enabled; run with `--telemetry-enabled=false` to disable
<info> [2021-05-06, 09:04:58] accepting incoming connections on port 8081
<info> [2021-05-06, 09:04:58] Subchannel Connectivity change to READY system="system"
<info> [2021-05-06, 09:04:58] pickfirstBalancer: HandleSubConnStateChange: 0xc000556340, {READY <nil>} system="system"
<info> [2021-05-06, 09:04:58] Channel Connectivity change to READY system="system"
<info> [2021-05-06, 09:04:58] event listener is starting id="event-listener" system="master" type="eventListener"
<info> [2021-05-06, 09:04:58] pod informer is starting id="pod-informer" system="master" type="informer"
<info> [2021-05-06, 09:04:58] preemption listener is starting id="preemption-listener" system="master" type="preemptionListener"
<info> [2021-05-06, 09:04:58] node informer has started id="node-informer" system="master" type="nodeInformer"
<info> [2021-05-06, 11:25:56] experiment state changed to ACTIVE id="4" system="master" type="experiment"
<info> [2021-05-06, 11:25:56] resources are requested by /experiments/4/b9b4e788-0d16-463d-988e-732645112365 (Task ID: bcccb550-1c39-423b-8353-304fc5dccf72) id="kubernetesRM" system="master" type="kubernetesResourceManager"
<info> [2021-05-06, 11:25:56] resources assigned with 1 pods id="kubernetesRM" system="master" task-handler="/experiments/4/b9b4e788-0d16-463d-988e-732645112365" task-id="bcccb550-1c39-423b-8353-304fc5dccf72" type="kubernetesResourceManager"
<info> [2021-05-06, 11:25:56] starting trial container: <RUN_STEP (100 Batches) (0 Prior Batches): (4,4,1)> experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<info> [2021-05-06, 11:25:56] registering pod handler handler="/pods/pod-eac146c4-1b8d-402b-a533-6ed3c3ff0845" id="pods" pod="exp-4-trial-4-rank-0-bcccb550-1c39-423b-8353-304fc5dccf72-tight-sailfish" system="master" type="pods"
<info> [2021-05-06, 11:25:56] created configMap exp-4-trial-4-rank-0-bcccb550-1c39-423b-8353-304fc5dccf72-tight-sailfish handler="/pods/pod-eac146c4-1b8d-402b-a533-6ed3c3ff0845" id="kubernetes-worker-3" system="master" type="requestProcessingWorker"
<info> [2021-05-06, 11:25:56] created pod exp-4-trial-4-rank-0-bcccb550-1c39-423b-8353-304fc5dccf72-tight-sailfish handler="/pods/pod-eac146c4-1b8d-402b-a533-6ed3c3ff0845" id="kubernetes-worker-3" system="master" type="requestProcessingWorker"
<info> [2021-05-06, 11:25:56] transitioning pod state from ASSIGNED to PULLING id="pod-eac146c4-1b8d-402b-a533-6ed3c3ff0845" pod="exp-4-trial-4-rank-0-bcccb550-1c39-423b-8353-304fc5dccf72-tight-sailfish" system="master" type="pod"
<info> [2021-05-06, 11:25:56] transitioning pod state from PULLING to STARTING id="pod-eac146c4-1b8d-402b-a533-6ed3c3ff0845" pod="exp-4-trial-4-rank-0-bcccb550-1c39-423b-8353-304fc5dccf72-tight-sailfish" system="master" type="pod"
<info> [2021-05-06, 11:26:00] transitioning pod state from STARTING to RUNNING id="pod-eac146c4-1b8d-402b-a533-6ed3c3ff0845" pod="exp-4-trial-4-rank-0-bcccb550-1c39-423b-8353-304fc5dccf72-tight-sailfish" system="master" type="pod"
<info> [2021-05-06, 11:26:00] found container running: eac146c4-1b8d-402b-a533-6ed3c3ff0845 (rank 0) experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<info> [2021-05-06, 11:26:00] pushing rendezvous information experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<info> [2021-05-06, 11:26:00] found not all containers are connected experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<info> [2021-05-06, 11:26:04] new connection from container eac146c4-1b8d-402b-a533-6ed3c3ff0845 trial 4 (experiment 4) at 100.82.21.4:46300
<info> [2021-05-06, 11:26:04] pushing rendezvous information experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<info> [2021-05-06, 11:26:04] found all containers are connected successfully experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<info> [2021-05-06, 11:26:26] transitioning pod state from RUNNING to TERMINATED id="pod-eac146c4-1b8d-402b-a533-6ed3c3ff0845" pod="exp-4-trial-4-rank-0-bcccb550-1c39-423b-8353-304fc5dccf72-tight-sailfish" system="master" type="pod"
<info> [2021-05-06, 11:26:26] pod failed with exit code: 1 id="pod-eac146c4-1b8d-402b-a533-6ed3c3ff0845" pod="exp-4-trial-4-rank-0-bcccb550-1c39-423b-8353-304fc5dccf72-tight-sailfish" system="master" type="pod"
<info> [2021-05-06, 11:26:26] found container terminated: eac146c4-1b8d-402b-a533-6ed3c3ff0845 experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<info> [2021-05-06, 11:26:26] forcibly terminating trial experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<error> [2021-05-06, 11:26:26] unexpected failure of trial after restart 0/5: container failed with non-zero exit code: (exit code 1) experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<info> [2021-05-06, 11:26:26] resetting trial 4 experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<info> [2021-05-06, 11:26:26] resources are released for /experiments/4/b9b4e788-0d16-463d-988e-732645112365 id="kubernetesRM" system="master" type="kubernetesResourceManager"
<info> [2021-05-06, 11:26:27] requesting to delete kubernetes resources id="pod-eac146c4-1b8d-402b-a533-6ed3c3ff0845" pod="exp-4-trial-4-rank-0-bcccb550-1c39-423b-8353-304fc5dccf72-tight-sailfish" system="master" type="pod"
<info> [2021-05-06, 11:26:27] de-registering pod handler handler="/pods/pod-eac146c4-1b8d-402b-a533-6ed3c3ff0845" id="pods" pod="exp-4-trial-4-rank-0-bcccb550-1c39-423b-8353-304fc5dccf72-tight-sailfish" system="master" type="pods"
<info> [2021-05-06, 11:26:27] deleted pod exp-4-trial-4-rank-0-bcccb550-1c39-423b-8353-304fc5dccf72-tight-sailfish handler="/pods/pod-eac146c4-1b8d-402b-a533-6ed3c3ff0845" id="kubernetes-worker-4" system="master" type="requestProcessingWorker"
<warning> [2021-05-06, 11:26:27] received pod status update for un-registered pod id="pods" pod-name="exp-4-trial-4-rank-0-bcccb550-1c39-423b-8353-304fc5dccf72-tight-sailfish" system="master" type="pods"
<info> [2021-05-06, 11:26:27] deleted configMap exp-4-trial-4-rank-0-bcccb550-1c39-423b-8353-304fc5dccf72-tight-sailfish handler="/pods/pod-eac146c4-1b8d-402b-a533-6ed3c3ff0845" id="kubernetes-worker-4" system="master" type="requestProcessingWorker"
<info> [2021-05-06, 11:26:27] resources are requested by /experiments/4/b9b4e788-0d16-463d-988e-732645112365 (Task ID: 37fda087-fb9f-4ece-ac40-3fd48dd20b05) id="kubernetesRM" system="master" type="kubernetesResourceManager"
<info> [2021-05-06, 11:26:27] resources assigned with 1 pods id="kubernetesRM" system="master" task-handler="/experiments/4/b9b4e788-0d16-463d-988e-732645112365" task-id="37fda087-fb9f-4ece-ac40-3fd48dd20b05" type="kubernetesResourceManager"
<info> [2021-05-06, 11:26:27] starting trial container: <RUN_STEP (100 Batches) (0 Prior Batches): (4,4,1)> experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<info> [2021-05-06, 11:26:27] registering pod handler handler="/pods/pod-407621fe-2e52-43f3-a4d0-9058cf2630ed" id="pods" pod="exp-4-trial-4-rank-0-37fda087-fb9f-4ece-ac40-3fd48dd20b05-credible-buck" system="master" type="pods"
<info> [2021-05-06, 11:26:27] created configMap exp-4-trial-4-rank-0-37fda087-fb9f-4ece-ac40-3fd48dd20b05-credible-buck handler="/pods/pod-407621fe-2e52-43f3-a4d0-9058cf2630ed" id="kubernetes-worker-0" system="master" type="requestProcessingWorker"
<info> [2021-05-06, 11:26:27] created pod exp-4-trial-4-rank-0-37fda087-fb9f-4ece-ac40-3fd48dd20b05-credible-buck handler="/pods/pod-407621fe-2e52-43f3-a4d0-9058cf2630ed" id="kubernetes-worker-0" system="master" type="requestProcessingWorker"
<info> [2021-05-06, 11:26:27] transitioning pod state from ASSIGNED to PULLING id="pod-407621fe-2e52-43f3-a4d0-9058cf2630ed" pod="exp-4-trial-4-rank-0-37fda087-fb9f-4ece-ac40-3fd48dd20b05-credible-buck" system="master" type="pod"
<info> [2021-05-06, 11:26:27] transitioning pod state from PULLING to STARTING id="pod-407621fe-2e52-43f3-a4d0-9058cf2630ed" pod="exp-4-trial-4-rank-0-37fda087-fb9f-4ece-ac40-3fd48dd20b05-credible-buck" system="master" type="pod"
<warning> [2021-05-06, 11:26:33] received pod status update for un-registered pod id="pods" pod-name="exp-4-trial-4-rank-0-bcccb550-1c39-423b-8353-304fc5dccf72-tight-sailfish" system="master" type="pods"
<info> [2021-05-06, 11:26:33] transitioning pod state from STARTING to RUNNING id="pod-407621fe-2e52-43f3-a4d0-9058cf2630ed" pod="exp-4-trial-4-rank-0-37fda087-fb9f-4ece-ac40-3fd48dd20b05-credible-buck" system="master" type="pod"
<info> [2021-05-06, 11:26:33] found container running: 407621fe-2e52-43f3-a4d0-9058cf2630ed (rank 0) experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<info> [2021-05-06, 11:26:33] pushing rendezvous information experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<info> [2021-05-06, 11:26:33] found not all containers are connected experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<warning> [2021-05-06, 11:26:34] received pod status update for un-registered pod id="pods" pod-name="exp-4-trial-4-rank-0-bcccb550-1c39-423b-8353-304fc5dccf72-tight-sailfish" system="master" type="pods"
<warning> [2021-05-06, 11:26:34] received pod status update for un-registered pod id="pods" pod-name="exp-4-trial-4-rank-0-bcccb550-1c39-423b-8353-304fc5dccf72-tight-sailfish" system="master" type="pods"
<info> [2021-05-06, 11:26:36] new connection from container 407621fe-2e52-43f3-a4d0-9058cf2630ed trial 4 (experiment 4) at 100.82.21.5:44404
<info> [2021-05-06, 11:26:36] pushing rendezvous information experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<info> [2021-05-06, 11:26:36] found all containers are connected successfully experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<info> [2021-05-06, 11:26:48] transitioning pod state from RUNNING to TERMINATED id="pod-407621fe-2e52-43f3-a4d0-9058cf2630ed" pod="exp-4-trial-4-rank-0-37fda087-fb9f-4ece-ac40-3fd48dd20b05-credible-buck" system="master" type="pod"
<info> [2021-05-06, 11:26:48] pod failed with exit code: 1 id="pod-407621fe-2e52-43f3-a4d0-9058cf2630ed" pod="exp-4-trial-4-rank-0-37fda087-fb9f-4ece-ac40-3fd48dd20b05-credible-buck" system="master" type="pod"
<info> [2021-05-06, 11:26:48] found container terminated: 407621fe-2e52-43f3-a4d0-9058cf2630ed experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<info> [2021-05-06, 11:26:48] forcibly terminating trial experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<info> [2021-05-06, 11:26:48] requesting to delete kubernetes resources id="pod-407621fe-2e52-43f3-a4d0-9058cf2630ed" pod="exp-4-trial-4-rank-0-37fda087-fb9f-4ece-ac40-3fd48dd20b05-credible-buck" system="master" type="pod"
<info> [2021-05-06, 11:26:48] de-registering pod handler handler="/pods/pod-407621fe-2e52-43f3-a4d0-9058cf2630ed" id="pods" pod="exp-4-trial-4-rank-0-37fda087-fb9f-4ece-ac40-3fd48dd20b05-credible-buck" system="master" type="pods"
<info> [2021-05-06, 11:26:48] deleted pod exp-4-trial-4-rank-0-37fda087-fb9f-4ece-ac40-3fd48dd20b05-credible-buck handler="/pods/pod-407621fe-2e52-43f3-a4d0-9058cf2630ed" id="kubernetes-worker-1" system="master" type="requestProcessingWorker"
<warning> [2021-05-06, 11:26:48] received pod status update for un-registered pod id="pods" pod-name="exp-4-trial-4-rank-0-37fda087-fb9f-4ece-ac40-3fd48dd20b05-credible-buck" system="master" type="pods"
<info> [2021-05-06, 11:26:48] deleted configMap exp-4-trial-4-rank-0-37fda087-fb9f-4ece-ac40-3fd48dd20b05-credible-buck handler="/pods/pod-407621fe-2e52-43f3-a4d0-9058cf2630ed" id="kubernetes-worker-1" system="master" type="requestProcessingWorker"
<error> [2021-05-06, 11:26:48] unexpected failure of trial after restart 1/5: container failed with non-zero exit code: (exit code 1) experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<info> [2021-05-06, 11:26:48] resetting trial 4 experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<info> [2021-05-06, 11:26:48] resources are released for /experiments/4/b9b4e788-0d16-463d-988e-732645112365 id="kubernetesRM" system="master" type="kubernetesResourceManager"
<info> [2021-05-06, 11:26:48] resources are requested by /experiments/4/b9b4e788-0d16-463d-988e-732645112365 (Task ID: d6bfa1db-15fe-4f01-b699-41f60c327e33) id="kubernetesRM" system="master" type="kubernetesResourceManager"
<info> [2021-05-06, 11:26:49] resources assigned with 1 pods id="kubernetesRM" system="master" task-handler="/experiments/4/b9b4e788-0d16-463d-988e-732645112365" task-id="d6bfa1db-15fe-4f01-b699-41f60c327e33" type="kubernetesResourceManager"
<info> [2021-05-06, 11:26:49] starting trial container: <RUN_STEP (100 Batches) (0 Prior Batches): (4,4,1)> experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<info> [2021-05-06, 11:26:49] registering pod handler handler="/pods/pod-4db3595f-1526-4dde-9385-dc7de6860d56" id="pods" pod="exp-4-trial-4-rank-0-d6bfa1db-15fe-4f01-b699-41f60c327e33-national-doe" system="master" type="pods"
<info> [2021-05-06, 11:26:49] created configMap exp-4-trial-4-rank-0-d6bfa1db-15fe-4f01-b699-41f60c327e33-national-doe handler="/pods/pod-4db3595f-1526-4dde-9385-dc7de6860d56" id="kubernetes-worker-2" system="master" type="requestProcessingWorker"
<info> [2021-05-06, 11:26:49] created pod exp-4-trial-4-rank-0-d6bfa1db-15fe-4f01-b699-41f60c327e33-national-doe handler="/pods/pod-4db3595f-1526-4dde-9385-dc7de6860d56" id="kubernetes-worker-2" system="master" type="requestProcessingWorker"
<info> [2021-05-06, 11:26:49] transitioning pod state from ASSIGNED to PULLING id="pod-4db3595f-1526-4dde-9385-dc7de6860d56" pod="exp-4-trial-4-rank-0-d6bfa1db-15fe-4f01-b699-41f60c327e33-national-doe" system="master" type="pod"
<info> [2021-05-06, 11:26:49] transitioning pod state from PULLING to STARTING id="pod-4db3595f-1526-4dde-9385-dc7de6860d56" pod="exp-4-trial-4-rank-0-d6bfa1db-15fe-4f01-b699-41f60c327e33-national-doe" system="master" type="pod"
<warning> [2021-05-06, 11:26:55] received pod status update for un-registered pod id="pods" pod-name="exp-4-trial-4-rank-0-37fda087-fb9f-4ece-ac40-3fd48dd20b05-credible-buck" system="master" type="pods"
<info> [2021-05-06, 11:26:55] transitioning pod state from STARTING to RUNNING id="pod-4db3595f-1526-4dde-9385-dc7de6860d56" pod="exp-4-trial-4-rank-0-d6bfa1db-15fe-4f01-b699-41f60c327e33-national-doe" system="master" type="pod"
<info> [2021-05-06, 11:26:55] found container running: 4db3595f-1526-4dde-9385-dc7de6860d56 (rank 0) experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<info> [2021-05-06, 11:26:55] pushing rendezvous information experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<info> [2021-05-06, 11:26:55] found not all containers are connected experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<info> [2021-05-06, 11:26:58] new connection from container 4db3595f-1526-4dde-9385-dc7de6860d56 trial 4 (experiment 4) at 100.82.21.8:57194
<info> [2021-05-06, 11:26:58] pushing rendezvous information experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<info> [2021-05-06, 11:26:58] found all containers are connected successfully experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<warning> [2021-05-06, 11:27:02] received pod status update for un-registered pod id="pods" pod-name="exp-4-trial-4-rank-0-37fda087-fb9f-4ece-ac40-3fd48dd20b05-credible-buck" system="master" type="pods"
<warning> [2021-05-06, 11:27:02] received pod status update for un-registered pod id="pods" pod-name="exp-4-trial-4-rank-0-37fda087-fb9f-4ece-ac40-3fd48dd20b05-credible-buck" system="master" type="pods"
<warning> [2021-05-06, 11:27:34] preemption listener stopped unexpectedly id="preemption-listener" system="master" type="preemptionListener"
<info> [2021-05-06, 11:27:34] preemption listener is starting id="preemption-listener" system="master" type="preemptionListener"
<info> [2021-05-06, 11:27:49] transitioning pod state from RUNNING to TERMINATED id="pod-4db3595f-1526-4dde-9385-dc7de6860d56" pod="exp-4-trial-4-rank-0-d6bfa1db-15fe-4f01-b699-41f60c327e33-national-doe" system="master" type="pod"
<info> [2021-05-06, 11:27:49] pod failed with exit code: 1 id="pod-4db3595f-1526-4dde-9385-dc7de6860d56" pod="exp-4-trial-4-rank-0-d6bfa1db-15fe-4f01-b699-41f60c327e33-national-doe" system="master" type="pod"
<info> [2021-05-06, 11:27:49] found container terminated: 4db3595f-1526-4dde-9385-dc7de6860d56 experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<info> [2021-05-06, 11:27:49] forcibly terminating trial experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<error> [2021-05-06, 11:27:49] unexpected failure of trial after restart 2/5: container failed with non-zero exit code: (exit code 1) experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<info> [2021-05-06, 11:27:49] resetting trial 4 experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<info> [2021-05-06, 11:27:49] resources are released for /experiments/4/b9b4e788-0d16-463d-988e-732645112365 id="kubernetesRM" system="master" type="kubernetesResourceManager"
<info> [2021-05-06, 11:27:49] requesting to delete kubernetes resources id="pod-4db3595f-1526-4dde-9385-dc7de6860d56" pod="exp-4-trial-4-rank-0-d6bfa1db-15fe-4f01-b699-41f60c327e33-national-doe" system="master" type="pod"
<info> [2021-05-06, 11:27:49] de-registering pod handler handler="/pods/pod-4db3595f-1526-4dde-9385-dc7de6860d56" id="pods" pod="exp-4-trial-4-rank-0-d6bfa1db-15fe-4f01-b699-41f60c327e33-national-doe" system="master" type="pods"
<info> [2021-05-06, 11:27:49] deleted pod exp-4-trial-4-rank-0-d6bfa1db-15fe-4f01-b699-41f60c327e33-national-doe handler="/pods/pod-4db3595f-1526-4dde-9385-dc7de6860d56" id="kubernetes-worker-3" system="master" type="requestProcessingWorker"
<info> [2021-05-06, 11:27:49] resources are requested by /experiments/4/b9b4e788-0d16-463d-988e-732645112365 (Task ID: f672e95a-a74b-44c7-991f-2c83fe90e225) id="kubernetesRM" system="master" type="kubernetesResourceManager"
<warning> [2021-05-06, 11:27:49] received pod status update for un-registered pod id="pods" pod-name="exp-4-trial-4-rank-0-d6bfa1db-15fe-4f01-b699-41f60c327e33-national-doe" system="master" type="pods"
<info> [2021-05-06, 11:27:49] deleted configMap exp-4-trial-4-rank-0-d6bfa1db-15fe-4f01-b699-41f60c327e33-national-doe handler="/pods/pod-4db3595f-1526-4dde-9385-dc7de6860d56" id="kubernetes-worker-3" system="master" type="requestProcessingWorker"
<info> [2021-05-06, 11:27:50] resources assigned with 1 pods id="kubernetesRM" system="master" task-handler="/experiments/4/b9b4e788-0d16-463d-988e-732645112365" task-id="f672e95a-a74b-44c7-991f-2c83fe90e225" type="kubernetesResourceManager"
<info> [2021-05-06, 11:27:50] starting trial container: <RUN_STEP (100 Batches) (0 Prior Batches): (4,4,1)> experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<info> [2021-05-06, 11:27:50] registering pod handler handler="/pods/pod-45593757-aeb6-495b-ae10-b8cfa617ca1c" id="pods" pod="exp-4-trial-4-rank-0-f672e95a-a74b-44c7-991f-2c83fe90e225-precious-magpie" system="master" type="pods"
<info> [2021-05-06, 11:27:50] created configMap exp-4-trial-4-rank-0-f672e95a-a74b-44c7-991f-2c83fe90e225-precious-magpie handler="/pods/pod-45593757-aeb6-495b-ae10-b8cfa617ca1c" id="kubernetes-worker-4" system="master" type="requestProcessingWorker"
<info> [2021-05-06, 11:27:50] created pod exp-4-trial-4-rank-0-f672e95a-a74b-44c7-991f-2c83fe90e225-precious-magpie handler="/pods/pod-45593757-aeb6-495b-ae10-b8cfa617ca1c" id="kubernetes-worker-4" system="master" type="requestProcessingWorker"
<info> [2021-05-06, 11:27:50] transitioning pod state from ASSIGNED to PULLING id="pod-45593757-aeb6-495b-ae10-b8cfa617ca1c" pod="exp-4-trial-4-rank-0-f672e95a-a74b-44c7-991f-2c83fe90e225-precious-magpie" system="master" type="pod"
<info> [2021-05-06, 11:27:50] transitioning pod state from PULLING to STARTING id="pod-45593757-aeb6-495b-ae10-b8cfa617ca1c" pod="exp-4-trial-4-rank-0-f672e95a-a74b-44c7-991f-2c83fe90e225-precious-magpie" system="master" type="pod"
<info> [2021-05-06, 11:27:56] transitioning pod state from STARTING to RUNNING id="pod-45593757-aeb6-495b-ae10-b8cfa617ca1c" pod="exp-4-trial-4-rank-0-f672e95a-a74b-44c7-991f-2c83fe90e225-precious-magpie" system="master" type="pod"
<info> [2021-05-06, 11:27:56] found container running: 45593757-aeb6-495b-ae10-b8cfa617ca1c (rank 0) experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<info> [2021-05-06, 11:27:56] pushing rendezvous information experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<info> [2021-05-06, 11:27:56] found not all containers are connected experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<warning> [2021-05-06, 11:27:56] received pod status update for un-registered pod id="pods" pod-name="exp-4-trial-4-rank-0-d6bfa1db-15fe-4f01-b699-41f60c327e33-national-doe" system="master" type="pods"
<info> [2021-05-06, 11:28:00] new connection from container 45593757-aeb6-495b-ae10-b8cfa617ca1c trial 4 (experiment 4) at 100.82.21.9:38376
<info> [2021-05-06, 11:28:00] pushing rendezvous information experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<info> [2021-05-06, 11:28:00] found all containers are connected successfully experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<warning> [2021-05-06, 11:28:02] received pod status update for un-registered pod id="pods" pod-name="exp-4-trial-4-rank-0-d6bfa1db-15fe-4f01-b699-41f60c327e33-national-doe" system="master" type="pods"
<warning> [2021-05-06, 11:28:02] received pod status update for un-registered pod id="pods" pod-name="exp-4-trial-4-rank-0-d6bfa1db-15fe-4f01-b699-41f60c327e33-national-doe" system="master" type="pods"
<info> [2021-05-06, 11:28:19] transitioning pod state from RUNNING to TERMINATED id="pod-45593757-aeb6-495b-ae10-b8cfa617ca1c" pod="exp-4-trial-4-rank-0-f672e95a-a74b-44c7-991f-2c83fe90e225-precious-magpie" system="master" type="pod"
<info> [2021-05-06, 11:28:19] pod failed with exit code: 1 id="pod-45593757-aeb6-495b-ae10-b8cfa617ca1c" pod="exp-4-trial-4-rank-0-f672e95a-a74b-44c7-991f-2c83fe90e225-precious-magpie" system="master" type="pod"
<info> [2021-05-06, 11:28:19] found container terminated: 45593757-aeb6-495b-ae10-b8cfa617ca1c experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<info> [2021-05-06, 11:28:19] forcibly terminating trial experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<error> [2021-05-06, 11:28:19] unexpected failure of trial after restart 3/5: container failed with non-zero exit code: (exit code 1) experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<info> [2021-05-06, 11:28:19] resetting trial 4 experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<info> [2021-05-06, 11:28:19] resources are released for /experiments/4/b9b4e788-0d16-463d-988e-732645112365 id="kubernetesRM" system="master" type="kubernetesResourceManager"
<info> [2021-05-06, 11:28:19] requesting to delete kubernetes resources id="pod-45593757-aeb6-495b-ae10-b8cfa617ca1c" pod="exp-4-trial-4-rank-0-f672e95a-a74b-44c7-991f-2c83fe90e225-precious-magpie" system="master" type="pod"
<info> [2021-05-06, 11:28:19] de-registering pod handler handler="/pods/pod-45593757-aeb6-495b-ae10-b8cfa617ca1c" id="pods" pod="exp-4-trial-4-rank-0-f672e95a-a74b-44c7-991f-2c83fe90e225-precious-magpie" system="master" type="pods"
<info> [2021-05-06, 11:28:19] deleted pod exp-4-trial-4-rank-0-f672e95a-a74b-44c7-991f-2c83fe90e225-precious-magpie handler="/pods/pod-45593757-aeb6-495b-ae10-b8cfa617ca1c" id="kubernetes-worker-0" system="master" type="requestProcessingWorker"
<warning> [2021-05-06, 11:28:19] received pod status update for un-registered pod id="pods" pod-name="exp-4-trial-4-rank-0-f672e95a-a74b-44c7-991f-2c83fe90e225-precious-magpie" system="master" type="pods"
<info> [2021-05-06, 11:28:19] deleted configMap exp-4-trial-4-rank-0-f672e95a-a74b-44c7-991f-2c83fe90e225-precious-magpie handler="/pods/pod-45593757-aeb6-495b-ae10-b8cfa617ca1c" id="kubernetes-worker-0" system="master" type="requestProcessingWorker"
<info> [2021-05-06, 11:28:19] resources are requested by /experiments/4/b9b4e788-0d16-463d-988e-732645112365 (Task ID: c824a795-d349-4383-82df-599f3737820c) id="kubernetesRM" system="master" type="kubernetesResourceManager"
<info> [2021-05-06, 11:28:20] resources assigned with 1 pods id="kubernetesRM" system="master" task-handler="/experiments/4/b9b4e788-0d16-463d-988e-732645112365" task-id="c824a795-d349-4383-82df-599f3737820c" type="kubernetesResourceManager"
<info> [2021-05-06, 11:28:20] starting trial container: <RUN_STEP (100 Batches) (0 Prior Batches): (4,4,1)> experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<info> [2021-05-06, 11:28:20] registering pod handler handler="/pods/pod-c1b0d268-193b-4591-97bb-a2b44befa968" id="pods" pod="exp-4-trial-4-rank-0-c824a795-d349-4383-82df-599f3737820c-apt-swan" system="master" type="pods"
<info> [2021-05-06, 11:28:20] created configMap exp-4-trial-4-rank-0-c824a795-d349-4383-82df-599f3737820c-apt-swan handler="/pods/pod-c1b0d268-193b-4591-97bb-a2b44befa968" id="kubernetes-worker-1" system="master" type="requestProcessingWorker"
<info> [2021-05-06, 11:28:20] created pod exp-4-trial-4-rank-0-c824a795-d349-4383-82df-599f3737820c-apt-swan handler="/pods/pod-c1b0d268-193b-4591-97bb-a2b44befa968" id="kubernetes-worker-1" system="master" type="requestProcessingWorker"
<info> [2021-05-06, 11:28:20] transitioning pod state from ASSIGNED to PULLING id="pod-c1b0d268-193b-4591-97bb-a2b44befa968" pod="exp-4-trial-4-rank-0-c824a795-d349-4383-82df-599f3737820c-apt-swan" system="master" type="pod"
<info> [2021-05-06, 11:28:20] transitioning pod state from PULLING to STARTING id="pod-c1b0d268-193b-4591-97bb-a2b44befa968" pod="exp-4-trial-4-rank-0-c824a795-d349-4383-82df-599f3737820c-apt-swan" system="master" type="pod"
<info> [2021-05-06, 11:28:25] transitioning pod state from STARTING to RUNNING id="pod-c1b0d268-193b-4591-97bb-a2b44befa968" pod="exp-4-trial-4-rank-0-c824a795-d349-4383-82df-599f3737820c-apt-swan" system="master" type="pod"
<info> [2021-05-06, 11:28:25] found container running: c1b0d268-193b-4591-97bb-a2b44befa968 (rank 0) experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<info> [2021-05-06, 11:28:25] pushing rendezvous information experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<info> [2021-05-06, 11:28:25] found not all containers are connected experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<warning> [2021-05-06, 11:28:26] received pod status update for un-registered pod id="pods" pod-name="exp-4-trial-4-rank-0-f672e95a-a74b-44c7-991f-2c83fe90e225-precious-magpie" system="master" type="pods"
<info> [2021-05-06, 11:28:29] new connection from container c1b0d268-193b-4591-97bb-a2b44befa968 trial 4 (experiment 4) at 100.82.21.10:47188
<info> [2021-05-06, 11:28:29] pushing rendezvous information experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<info> [2021-05-06, 11:28:29] found all containers are connected successfully experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<warning> [2021-05-06, 11:28:32] received pod status update for un-registered pod id="pods" pod-name="exp-4-trial-4-rank-0-f672e95a-a74b-44c7-991f-2c83fe90e225-precious-magpie" system="master" type="pods"
<warning> [2021-05-06, 11:28:32] received pod status update for un-registered pod id="pods" pod-name="exp-4-trial-4-rank-0-f672e95a-a74b-44c7-991f-2c83fe90e225-precious-magpie" system="master" type="pods"
<info> [2021-05-06, 11:28:41] transitioning pod state from RUNNING to TERMINATED id="pod-c1b0d268-193b-4591-97bb-a2b44befa968" pod="exp-4-trial-4-rank-0-c824a795-d349-4383-82df-599f3737820c-apt-swan" system="master" type="pod"
<info> [2021-05-06, 11:28:41] pod failed with exit code: 1 id="pod-c1b0d268-193b-4591-97bb-a2b44befa968" pod="exp-4-trial-4-rank-0-c824a795-d349-4383-82df-599f3737820c-apt-swan" system="master" type="pod"
<info> [2021-05-06, 11:28:41] found container terminated: c1b0d268-193b-4591-97bb-a2b44befa968 experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<info> [2021-05-06, 11:28:41] forcibly terminating trial experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<info> [2021-05-06, 11:28:41] requesting to delete kubernetes resources id="pod-c1b0d268-193b-4591-97bb-a2b44befa968" pod="exp-4-trial-4-rank-0-c824a795-d349-4383-82df-599f3737820c-apt-swan" system="master" type="pod"
<info> [2021-05-06, 11:28:41] de-registering pod handler handler="/pods/pod-c1b0d268-193b-4591-97bb-a2b44befa968" id="pods" pod="exp-4-trial-4-rank-0-c824a795-d349-4383-82df-599f3737820c-apt-swan" system="master" type="pods"
<info> [2021-05-06, 11:28:41] deleted pod exp-4-trial-4-rank-0-c824a795-d349-4383-82df-599f3737820c-apt-swan handler="/pods/pod-c1b0d268-193b-4591-97bb-a2b44befa968" id="kubernetes-worker-2" system="master" type="requestProcessingWorker"
<warning> [2021-05-06, 11:28:41] received pod status update for un-registered pod id="pods" pod-name="exp-4-trial-4-rank-0-c824a795-d349-4383-82df-599f3737820c-apt-swan" system="master" type="pods"
<info> [2021-05-06, 11:28:41] deleted configMap exp-4-trial-4-rank-0-c824a795-d349-4383-82df-599f3737820c-apt-swan handler="/pods/pod-c1b0d268-193b-4591-97bb-a2b44befa968" id="kubernetes-worker-2" system="master" type="requestProcessingWorker"
<error> [2021-05-06, 11:28:41] unexpected failure of trial after restart 4/5: container failed with non-zero exit code: (exit code 1) experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<info> [2021-05-06, 11:28:41] resetting trial 4 experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<info> [2021-05-06, 11:28:41] resources are released for /experiments/4/b9b4e788-0d16-463d-988e-732645112365 id="kubernetesRM" system="master" type="kubernetesResourceManager"
<info> [2021-05-06, 11:28:41] resources are requested by /experiments/4/b9b4e788-0d16-463d-988e-732645112365 (Task ID: ad26f119-f705-4544-be46-06e7efd80a78) id="kubernetesRM" system="master" type="kubernetesResourceManager"
<info> [2021-05-06, 11:28:42] resources assigned with 1 pods id="kubernetesRM" system="master" task-handler="/experiments/4/b9b4e788-0d16-463d-988e-732645112365" task-id="ad26f119-f705-4544-be46-06e7efd80a78" type="kubernetesResourceManager"
<info> [2021-05-06, 11:28:42] starting trial container: <RUN_STEP (100 Batches) (0 Prior Batches): (4,4,1)> experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<info> [2021-05-06, 11:28:42] registering pod handler handler="/pods/pod-98e667f9-a2e1-46a5-87b1-42786cd67805" id="pods" pod="exp-4-trial-4-rank-0-ad26f119-f705-4544-be46-06e7efd80a78-enhanced-alien" system="master" type="pods"
<info> [2021-05-06, 11:28:42] created configMap exp-4-trial-4-rank-0-ad26f119-f705-4544-be46-06e7efd80a78-enhanced-alien handler="/pods/pod-98e667f9-a2e1-46a5-87b1-42786cd67805" id="kubernetes-worker-3" system="master" type="requestProcessingWorker"
<info> [2021-05-06, 11:28:42] created pod exp-4-trial-4-rank-0-ad26f119-f705-4544-be46-06e7efd80a78-enhanced-alien handler="/pods/pod-98e667f9-a2e1-46a5-87b1-42786cd67805" id="kubernetes-worker-3" system="master" type="requestProcessingWorker"
<info> [2021-05-06, 11:28:42] transitioning pod state from ASSIGNED to PULLING id="pod-98e667f9-a2e1-46a5-87b1-42786cd67805" pod="exp-4-trial-4-rank-0-ad26f119-f705-4544-be46-06e7efd80a78-enhanced-alien" system="master" type="pod"
<info> [2021-05-06, 11:28:42] transitioning pod state from PULLING to STARTING id="pod-98e667f9-a2e1-46a5-87b1-42786cd67805" pod="exp-4-trial-4-rank-0-ad26f119-f705-4544-be46-06e7efd80a78-enhanced-alien" system="master" type="pod"
<warning> [2021-05-06, 11:28:48] received pod status update for un-registered pod id="pods" pod-name="exp-4-trial-4-rank-0-c824a795-d349-4383-82df-599f3737820c-apt-swan" system="master" type="pods"
<info> [2021-05-06, 11:28:48] transitioning pod state from STARTING to RUNNING id="pod-98e667f9-a2e1-46a5-87b1-42786cd67805" pod="exp-4-trial-4-rank-0-ad26f119-f705-4544-be46-06e7efd80a78-enhanced-alien" system="master" type="pod"
<info> [2021-05-06, 11:28:48] found container running: 98e667f9-a2e1-46a5-87b1-42786cd67805 (rank 0) experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<info> [2021-05-06, 11:28:48] pushing rendezvous information experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<info> [2021-05-06, 11:28:48] found not all containers are connected experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<warning> [2021-05-06, 11:28:49] received pod status update for un-registered pod id="pods" pod-name="exp-4-trial-4-rank-0-c824a795-d349-4383-82df-599f3737820c-apt-swan" system="master" type="pods"
<warning> [2021-05-06, 11:28:49] received pod status update for un-registered pod id="pods" pod-name="exp-4-trial-4-rank-0-c824a795-d349-4383-82df-599f3737820c-apt-swan" system="master" type="pods"
<info> [2021-05-06, 11:28:51] new connection from container 98e667f9-a2e1-46a5-87b1-42786cd67805 trial 4 (experiment 4) at 100.82.21.11:46114
<info> [2021-05-06, 11:28:51] pushing rendezvous information experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<info> [2021-05-06, 11:28:51] found all containers are connected successfully experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<info> [2021-05-06, 11:29:03] transitioning pod state from RUNNING to TERMINATED id="pod-98e667f9-a2e1-46a5-87b1-42786cd67805" pod="exp-4-trial-4-rank-0-ad26f119-f705-4544-be46-06e7efd80a78-enhanced-alien" system="master" type="pod"
<info> [2021-05-06, 11:29:03] pod failed with exit code: 1 id="pod-98e667f9-a2e1-46a5-87b1-42786cd67805" pod="exp-4-trial-4-rank-0-ad26f119-f705-4544-be46-06e7efd80a78-enhanced-alien" system="master" type="pod"
<info> [2021-05-06, 11:29:03] requesting to delete kubernetes resources id="pod-98e667f9-a2e1-46a5-87b1-42786cd67805" pod="exp-4-trial-4-rank-0-ad26f119-f705-4544-be46-06e7efd80a78-enhanced-alien" system="master" type="pod"
<info> [2021-05-06, 11:29:03] found container terminated: 98e667f9-a2e1-46a5-87b1-42786cd67805 experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<info> [2021-05-06, 11:29:03] forcibly terminating trial experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<info> [2021-05-06, 11:29:03] de-registering pod handler handler="/pods/pod-98e667f9-a2e1-46a5-87b1-42786cd67805" id="pods" pod="exp-4-trial-4-rank-0-ad26f119-f705-4544-be46-06e7efd80a78-enhanced-alien" system="master" type="pods"
<info> [2021-05-06, 11:29:03] received stop pod command for unregistered container id id="pods" pod-id="98e667f9-a2e1-46a5-87b1-42786cd67805" system="master" type="pods"
<info> [2021-05-06, 11:29:03] deleted pod exp-4-trial-4-rank-0-ad26f119-f705-4544-be46-06e7efd80a78-enhanced-alien handler="/pods/pod-98e667f9-a2e1-46a5-87b1-42786cd67805" id="kubernetes-worker-4" system="master" type="requestProcessingWorker"
<warning> [2021-05-06, 11:29:03] received pod status update for un-registered pod id="pods" pod-name="exp-4-trial-4-rank-0-ad26f119-f705-4544-be46-06e7efd80a78-enhanced-alien" system="master" type="pods"
<info> [2021-05-06, 11:29:03] deleted configMap exp-4-trial-4-rank-0-ad26f119-f705-4544-be46-06e7efd80a78-enhanced-alien handler="/pods/pod-98e667f9-a2e1-46a5-87b1-42786cd67805" id="kubernetes-worker-4" system="master" type="requestProcessingWorker"
<error> [2021-05-06, 11:29:04] unexpected failure of trial after restart 5/5: container failed with non-zero exit code: (exit code 1) experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<info> [2021-05-06, 11:29:04] resources are released for /experiments/4/b9b4e788-0d16-463d-988e-732645112365 id="kubernetesRM" system="master" type="kubernetesResourceManager"
<info> [2021-05-06, 11:29:04] exiting trial early from <RUN_STEP (100 Batches) (0 Prior Batches): (4,4,1)> with reason ERRORED experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<error> [2021-05-06, 11:29:04] error shutting down actor error="trial 4 failed and reached maximum number of restarts" experiment-id="4" id="b9b4e788-0d16-463d-988e-732645112365" system="master" trial-id="4" type="trial"
<info> [2021-05-06, 11:29:04] resources are released for /experiments/4/b9b4e788-0d16-463d-988e-732645112365 id="kubernetesRM" system="master" type="kubernetesResourceManager"
<info> [2021-05-06, 11:29:04] resources are released for /experiments/4/b9b4e788-0d16-463d-988e-732645112365 id="kubernetesRM" system="master" type="kubernetesResourceManager"
<info> [2021-05-06, 11:29:04] resources are released for /experiments/4/b9b4e788-0d16-463d-988e-732645112365 id="kubernetesRM" system="master" type="kubernetesResourceManager"
<info> [2021-05-06, 11:29:04] resources are released for /experiments/4/b9b4e788-0d16-463d-988e-732645112365 id="kubernetesRM" system="master" type="kubernetesResourceManager"
<info> [2021-05-06, 11:29:04] resources are released for /experiments/4/b9b4e788-0d16-463d-988e-732645112365 id="kubernetesRM" system="master" type="kubernetesResourceManager"
<info> [2021-05-06, 11:29:04] resources are released for /experiments/4/b9b4e788-0d16-463d-988e-732645112365 id="kubernetesRM" system="master" type="kubernetesResourceManager"
<error> [2021-05-06, 11:29:04] trial failed unexpectedly error="trial 4 failed and reached maximum number of restarts" id="4" system="master" type="experiment"
<info> [2021-05-06, 11:29:04] experiment state changed to STOPPING_ERROR id="4" system="master" type="experiment"
<info> [2021-05-06, 11:29:04] experiment state changed to ERROR id="4" system="master" type="experiment"
<info> [2021-05-06, 11:29:04] resources are requested by /experiment-4-checkpoint-gc (Task ID: 4d36779b-8de5-4a9a-9558-e6655935369f) id="kubernetesRM" system="master" type="kubernetesResourceManager"
<info> [2021-05-06, 11:29:04] experiment shut down successfully id="4" system="master" type="experiment"
<info> [2021-05-06, 11:29:04] resources assigned with 1 pods id="kubernetesRM" system="master" task-handler="/experiment-4-checkpoint-gc" task-id="4d36779b-8de5-4a9a-9558-e6655935369f" type="kubernetesResourceManager"
<info> [2021-05-06, 11:29:04] starting checkpoint garbage collection id="experiment-4-checkpoint-gc" system="master" type="checkpointGCTask"
<info> [2021-05-06, 11:29:04] registering pod handler handler="/pods/pod-1bb86bec-bf32-4c15-bc94-cf58ec69f811" id="pods" pod="gc-4d36779b-8de5-4a9a-9558-e6655935369f-natural-ant" system="master" type="pods"
<info> [2021-05-06, 11:29:04] created configMap gc-4d36779b-8de5-4a9a-9558-e6655935369f-natural-ant handler="/pods/pod-1bb86bec-bf32-4c15-bc94-cf58ec69f811" id="kubernetes-worker-0" system="master" type="requestProcessingWorker"
<info> [2021-05-06, 11:29:04] created pod gc-4d36779b-8de5-4a9a-9558-e6655935369f-natural-ant handler="/pods/pod-1bb86bec-bf32-4c15-bc94-cf58ec69f811" id="kubernetes-worker-0" system="master" type="requestProcessingWorker"
<info> [2021-05-06, 11:29:04] transitioning pod state from ASSIGNED to PULLING id="pod-1bb86bec-bf32-4c15-bc94-cf58ec69f811" pod="gc-4d36779b-8de5-4a9a-9558-e6655935369f-natural-ant" system="master" type="pod"
<info> [2021-05-06, 11:29:04] transitioning pod state from PULLING to STARTING id="pod-1bb86bec-bf32-4c15-bc94-cf58ec69f811" pod="gc-4d36779b-8de5-4a9a-9558-e6655935369f-natural-ant" system="master" type="pod"
<info> [2021-05-06, 11:29:07] transitioning pod state from STARTING to RUNNING id="pod-1bb86bec-bf32-4c15-bc94-cf58ec69f811" pod="gc-4d36779b-8de5-4a9a-9558-e6655935369f-natural-ant" system="master" type="pod"
<info> [2021-05-06, 11:29:09] transitioning pod state from RUNNING to TERMINATED id="pod-1bb86bec-bf32-4c15-bc94-cf58ec69f811" pod="gc-4d36779b-8de5-4a9a-9558-e6655935369f-natural-ant" system="master" type="pod"
<info> [2021-05-06, 11:29:09] pod exited successfully id="pod-1bb86bec-bf32-4c15-bc94-cf58ec69f811" pod="gc-4d36779b-8de5-4a9a-9558-e6655935369f-natural-ant" system="master" type="pod"
<info> [2021-05-06, 11:29:09] finished checkpoint garbage collection id="experiment-4-checkpoint-gc" system="master" type="checkpointGCTask"
<info> [2021-05-06, 11:29:09] resources are released for /experiment-4-checkpoint-gc id="kubernetesRM" system="master" type="kubernetesResourceManager"
<warning> [2021-05-06, 11:29:10] received pod status update for un-registered pod id="pods" pod-name="exp-4-trial-4-rank-0-ad26f119-f705-4544-be46-06e7efd80a78-enhanced-alien" system="master" type="pods"
<info> [2021-05-06, 11:29:10] requesting to delete kubernetes resources id="pod-1bb86bec-bf32-4c15-bc94-cf58ec69f811" pod="gc-4d36779b-8de5-4a9a-9558-e6655935369f-natural-ant" system="master" type="pod"
<info> [2021-05-06, 11:29:10] de-registering pod handler handler="/pods/pod-1bb86bec-bf32-4c15-bc94-cf58ec69f811" id="pods" pod="gc-4d36779b-8de5-4a9a-9558-e6655935369f-natural-ant" system="master" type="pods"
<info> [2021-05-06, 11:29:10] deleted pod gc-4d36779b-8de5-4a9a-9558-e6655935369f-natural-ant handler="/pods/pod-1bb86bec-bf32-4c15-bc94-cf58ec69f811" id="kubernetes-worker-1" system="master" type="requestProcessingWorker"
<warning> [2021-05-06, 11:29:10] received pod status update for un-registered pod id="pods" pod-name="gc-4d36779b-8de5-4a9a-9558-e6655935369f-natural-ant" system="master" type="pods"
<warning> [2021-05-06, 11:29:10] received pod status update for un-registered pod id="pods" pod-name="gc-4d36779b-8de5-4a9a-9558-e6655935369f-natural-ant" system="master" type="pods"
<info> [2021-05-06, 11:29:10] deleted configMap gc-4d36779b-8de5-4a9a-9558-e6655935369f-natural-ant handler="/pods/pod-1bb86bec-bf32-4c15-bc94-cf58ec69f811" id="kubernetes-worker-1" system="master" type="requestProcessingWorker"
<warning> [2021-05-06, 11:29:11] received pod status update for un-registered pod id="pods" pod-name="exp-4-trial-4-rank-0-ad26f119-f705-4544-be46-06e7efd80a78-enhanced-alien" system="master" type="pods"
<warning> [2021-05-06, 11:29:11] received pod status update for un-registered pod id="pods" pod-name="exp-4-trial-4-rank-0-ad26f119-f705-4544-be46-06e7efd80a78-enhanced-alien" system="master" type="pods"
[2021-05-06T03:34:00Z] ad2df585 || INFO: Pod exp-1-trial-1-rank-0-b88c8b58-dedb-4b06-8c61-af8eb09a4458-pumped-primate: Pod resources allocated.
[2021-05-06T03:34:00Z] ad2df585 || INFO: Pod exp-1-trial-1-rank-0-b88c8b58-dedb-4b06-8c61-af8eb09a4458-pumped-primate: Container image "determinedai/environments:cuda-10.2-pytorch-1.7-tf-1.15-gpu-da9ba40" already present on machine
[2021-05-06T03:34:01Z] ad2df585 || INFO: Pod exp-1-trial-1-rank-0-b88c8b58-dedb-4b06-8c61-af8eb09a4458-pumped-primate: Created container determined-init-container
[2021-05-06T03:34:01Z] ad2df585 || INFO: Pod exp-1-trial-1-rank-0-b88c8b58-dedb-4b06-8c61-af8eb09a4458-pumped-primate: Started container determined-init-container
[2021-05-06T03:34:01Z] ad2df585 || INFO: Pod exp-1-trial-1-rank-0-b88c8b58-dedb-4b06-8c61-af8eb09a4458-pumped-primate: Container image "fluent/fluent-bit:1.6" already present on machine
[2021-05-06T03:34:01Z] ad2df585 || INFO: Pod exp-1-trial-1-rank-0-b88c8b58-dedb-4b06-8c61-af8eb09a4458-pumped-primate: Created container determined-fluent-container
[2021-05-06T03:34:01Z] ad2df585 || INFO: Pod exp-1-trial-1-rank-0-b88c8b58-dedb-4b06-8c61-af8eb09a4458-pumped-primate: Started container determined-fluent-container
[2021-05-06T03:34:01Z] ad2df585 || INFO: Pod exp-1-trial-1-rank-0-b88c8b58-dedb-4b06-8c61-af8eb09a4458-pumped-primate: Container image "determinedai/environments:cuda-10.2-pytorch-1.7-tf-1.15-gpu-da9ba40" already present on machine
[2021-05-06T03:34:03Z] ad2df585 || INFO: Pod exp-1-trial-1-rank-0-b88c8b58-dedb-4b06-8c61-af8eb09a4458-pumped-primate: Created container determined-container
[2021-05-06T03:34:03Z] ad2df585 || INFO: Pod exp-1-trial-1-rank-0-b88c8b58-dedb-4b06-8c61-af8eb09a4458-pumped-primate: Started container determined-container
[2021-05-06T03:34:04Z] ad2df585 || + WORKING_DIR=/run/determined/workdir
[2021-05-06T03:34:04Z] ad2df585 || + STARTUP_HOOK=startup-hook.sh
[2021-05-06T03:34:04Z] ad2df585 || + export PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
[2021-05-06T03:34:04Z] ad2df585 || + PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
[2021-05-06T03:34:04Z] ad2df585 || + '[' -z '' ']'
[2021-05-06T03:34:04Z] ad2df585 || + export DET_PYTHON_EXECUTABLE=python3
[2021-05-06T03:34:04Z] ad2df585 || + DET_PYTHON_EXECUTABLE=python3
[2021-05-06T03:34:04Z] ad2df585 || + /bin/which python3
[2021-05-06T03:34:04Z] ad2df585 || + '[' /root = / ']'
[2021-05-06T03:34:04Z] ad2df585 || + python3 -m pip install -q --user /opt/determined/wheels/determined-0.15.2-py3-none-any.whl
[2021-05-06T03:34:07Z] ad2df585 || ERROR: determined 0.15.2 has requirement ruamel.yaml>=0.15.78, but you'll have ruamel-yaml 0.15.46 which is incompatible.
[2021-05-06T03:34:07Z] ad2df585 || + cd /run/determined/workdir
[2021-05-06T03:34:07Z] ad2df585 || + test -f startup-hook.sh
[2021-05-06T03:34:07Z] ad2df585 || + exec python3 -m determined.exec.harness
[2021-05-06T03:34:08Z] ad2df585 || INFO: New trial runner in (container ad2df585-0619-404d-92e2-465adc5d00eb) on agent k8agent: {'master_addr': '192.168.245.182', 'master_port': 8080, 'use_tls': 0, 'master_cert_file': None, 'master_cert_name': None, 'container_id': 'ad2df585-0619-404d-92e2-465adc5d00eb', 'experiment_config': {'description': 'mnist_pytorch_const', 'data': {'url': 'https://s3-us-west-2.amazonaws.com/determined-ai-test-data/pytorch_mnist.tar.gz'}, 'checkpoint_storage': {'host_path': '/checkpoints', 'save_experiment_best': 0, 'save_trial_best': 1, 'save_trial_latest': 1, 'type': 'shared_fs'}, 'perform_initial_validation': False, 'min_checkpoint_period': {'batches': 0}, 'min_validation_period': {'batches': 0}, 'checkpoint_policy': 'best', 'hyperparameters': {'dropout1': {'type': 'const', 'val': 0.25}, 'dropout2': {'type': 'const', 'val': 0.5}, 'global_batch_size': {'type': 'const', 'val': 64}, 'learning_rate': {'type': 'const', 'val': 1}, 'n_filters1': {'type': 'const', 'val': 32}, 'n_filters2': {'type': 'const', 'val': 64}}, 'searcher': {'max_length': {'batches': 937}, 'metric': 'validation_loss', 'name': 'single', 'smaller_is_better': True, 'source_checkpoint_uuid': None, 'source_trial_id': None}, 'resources': {'slots_per_trial': 1, 'weight': 1, 'native_parallel': False, 'agent_label': '', 'resource_pool': '', 'devices': None}, 'optimizations': {'aggregation_frequency': 1, 'average_aggregated_gradients': True, 'average_training_metrics': False, 'gradient_compression': False, 'mixed_precision': 'O0', 'tensor_fusion_threshold': 64, 'tensor_fusion_cycle_time': 5, 'auto_tune_tensor_fusion': False}, 'records_per_epoch': 0, 'scheduling_unit': 100, 'environment': {'image': {'cpu': 'determinedai/environments:py-3.7-pytorch-1.7-tf-1.15-cpu-da9ba40', 'gpu': 'determinedai/environments:cuda-10.2-pytorch-1.7-tf-1.15-gpu-da9ba40'}, 'environment_variables': {}, 'ports': None, 'force_pull_image': False, 'pod_spec': None, 'add_capabilities': None, 'drop_capabilities': None}, 'reproducibility': {'experiment_seed': 1620272038}, 'max_restarts': 5, 'debug': False, 'internal': None, 'entrypoint': 'model_def:MNistTrial', 'data_layer': {'container_storage_path': None, 'type': 'shared_fs'}, 'profiling': {'enabled': False, 'begin_on_batch': 0, 'end_after_batch': 0}}, 'hparams': {'dropout1': 0.25, 'dropout2': 0.5, 'global_batch_size': 64, 'learning_rate': 1, 'n_filters1': 32, 'n_filters2': 64}, 'initial_workload': <RUN_STEP (100 Batches): (1,1,1)>, 'latest_checkpoint': None, 'use_gpu': 1, 'container_gpus': ['GPU-e5db4f9e-516e-9153-4a40-b0be71028ab5'], 'slot_ids': [0], 'debug': False, 'workload_manager_type': 'TRIAL_WORKLOAD_MANAGER', 'det_rendezvous_ports': '1734,1750', 'det_trial_unique_port_offset': 0, 'det_trial_runner_network_interface': 'DET_AUTO_DETECT_NETWORK_INTERFACE', 'det_trial_id': '1', 'det_experiment_id': '1', 'det_cluster_id': '6d8bea20-a491-4451-953b-c3d093960076', 'trial_seed': 1001305543, 'managed_training': True, 'test_mode': False, 'on_cluster': True, '_per_slot_batch_size': 64, '_global_batch_size': 64}.
[2021-05-06T03:34:08Z] ad2df585 || INFO: Connecting to master at ws://192.168.245.182:8080/ws/trial/1/1/ad2df585-0619-404d-92e2-465adc5d00eb
[2021-05-06T03:34:08Z] ad2df585 || INFO: Connected to master
[2021-05-06T03:34:08Z] ad2df585 || INFO: Established WebSocket session with master
[2021-05-06T03:34:08Z] ad2df585 || INFO: Got rendezvous information: {'addrs': ['100.122.176.247:1734'], 'addrs2': ['100.122.176.247:1750'], 'containers': [{'addresses': [{'container_ip': '100.122.176.247', 'container_port': 1734, 'host_ip': '100.122.176.247', 'host_port': 1734}, {'container_ip': '100.122.176.247', 'container_port': 1750, 'host_ip': '100.122.176.247', 'host_port': 1750}]}], 'rank': 0, 'type': 'RENDEZVOUS_INFO'}
[2021-05-06T03:34:09Z] ad2df585 || INFO: Horovod config: {'use': False, 'aggregation_frequency': 1, 'fp16_compression': False, 'grad_updates_size_file': None, 'average_aggregated_gradients': True, 'average_training_metrics': False}.
[2021-05-06T03:34:09Z] ad2df585 || INFO: Loading Trial implementation with entrypoint model_def:MNistTrial.
[2021-05-06T03:34:10Z] ad2df585 || /opt/conda/lib/python3.7/site-packages/torch/cuda/__init__.py:104: UserWarning:
[2021-05-06T03:34:10Z] ad2df585 || GeForce RTX 3090 with CUDA capability sm_86 is not compatible with the current PyTorch installation.
[2021-05-06T03:34:10Z] ad2df585 || The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70 sm_75.
[2021-05-06T03:34:10Z] ad2df585 || If you want to use the GeForce RTX 3090 GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/
[2021-05-06T03:34:10Z] ad2df585 || warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))
[2021-05-06T03:34:12Z] ad2df585 || INFO: Creating PyTorchTrialController with MNistTrial.
[2021-05-06T03:34:12Z] ad2df585 || INFO: Downloading https://s3-us-west-2.amazonaws.com/determined-ai-test-data/pytorch_mnist.tar.gz to /tmp/data-rank0/MNIST/pytorch_mnist.tar.gz
[2021-05-06T03:34:18Z] ad2df585 || INFO: Running workload <RUN_STEP (100 Batches): (1,1,1)>
[2021-05-06T03:34:18Z] ad2df585 || INFO: WebSocket closed
[2021-05-06T03:34:18Z] ad2df585 || INFO: Disconnected from master, exiting gracefully
[2021-05-06T03:34:18Z] ad2df585 || Traceback (most recent call last):
[2021-05-06T03:34:18Z] ad2df585 || File "/opt/conda/lib/python3.7/runpy.py", line 193, in _run_module_as_main
[2021-05-06T03:34:18Z] ad2df585 || "__main__", mod_spec)
[2021-05-06T03:34:18Z] ad2df585 || File "/opt/conda/lib/python3.7/runpy.py", line 85, in _run_code
[2021-05-06T03:34:18Z] ad2df585 || File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/exec/harness.py", line 236, in <module>
[2021-05-06T03:34:18Z] ad2df585 || exec(code, run_globals)
[2021-05-06T03:34:18Z] ad2df585 || main()
[2021-05-06T03:34:18Z] ad2df585 || File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/exec/harness.py", line 229, in main
[2021-05-06T03:34:18Z] ad2df585 || build_and_run_training_pipeline(env)
[2021-05-06T03:34:18Z] ad2df585 || File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/exec/harness.py", line 149, in build_and_run_training_pipeline
[2021-05-06T03:34:18Z] ad2df585 || controller.run()
[2021-05-06T03:34:18Z] ad2df585 || File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/pytorch/_pytorch_trial.py", line 152, in run
[2021-05-06T03:34:18Z] ad2df585 || w.total_batches_processed,
[2021-05-06T03:34:18Z] ad2df585 || File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/pytorch/_pytorch_trial.py", line 310, in _train_for_step
[2021-05-06T03:34:18Z] ad2df585 || batch_idx=batch_idx,
[2021-05-06T03:34:18Z] ad2df585 || File "/run/determined/workdir/model_def.py", line 84, in train_batch
[2021-05-06T03:34:18Z] ad2df585 || File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
[2021-05-06T03:34:18Z] ad2df585 || output = self.model(data)
[2021-05-06T03:34:18Z] ad2df585 || result = self.forward(*input, **kwargs)
[2021-05-06T03:34:18Z] ad2df585 || File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/container.py", line 117, in forward
[2021-05-06T03:34:18Z] ad2df585 || input = module(input)
[2021-05-06T03:34:18Z] ad2df585 || File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
[2021-05-06T03:34:18Z] ad2df585 || result = self.forward(*input, **kwargs)
[2021-05-06T03:34:18Z] ad2df585 || File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 423, in forward
[2021-05-06T03:34:18Z] ad2df585 || return self._conv_forward(input, self.weight)
[2021-05-06T03:34:18Z] ad2df585 || File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 420, in _conv_forward
[2021-05-06T03:34:18Z] ad2df585 || self.padding, self.dilation, self.groups)
[2021-05-06T03:34:18Z] ad2df585 || RuntimeError: CUDA error: no kernel image is available for execution on the device
[2021-05-06T03:34:20Z] ad2df585 || INFO: container failed with non-zero exit code: (exit code 1)
[2021-05-06T03:34:21Z] 4568b05f || INFO: Pod exp-1-trial-1-rank-0-4f28425a-7a10-40a5-a9d8-393134c7e607-mint-egret: Pod resources allocated.
[2021-05-06T03:34:22Z] 4568b05f || INFO: Pod exp-1-trial-1-rank-0-4f28425a-7a10-40a5-a9d8-393134c7e607-mint-egret: Container image "determinedai/environments:cuda-10.2-pytorch-1.7-tf-1.15-gpu-da9ba40" already present on machine
[2021-05-06T03:34:22Z] 4568b05f || INFO: Pod exp-1-trial-1-rank-0-4f28425a-7a10-40a5-a9d8-393134c7e607-mint-egret: Created container determined-init-container
[2021-05-06T03:34:23Z] 4568b05f || INFO: Pod exp-1-trial-1-rank-0-4f28425a-7a10-40a5-a9d8-393134c7e607-mint-egret: Started container determined-init-container
[2021-05-06T03:34:23Z] 4568b05f || INFO: Pod exp-1-trial-1-rank-0-4f28425a-7a10-40a5-a9d8-393134c7e607-mint-egret: Container image "fluent/fluent-bit:1.6" already present on machine
[2021-05-06T03:34:23Z] 4568b05f || INFO: Pod exp-1-trial-1-rank-0-4f28425a-7a10-40a5-a9d8-393134c7e607-mint-egret: Created container determined-fluent-container
[2021-05-06T03:34:23Z] 4568b05f || INFO: Pod exp-1-trial-1-rank-0-4f28425a-7a10-40a5-a9d8-393134c7e607-mint-egret: Started container determined-fluent-container
[2021-05-06T03:34:23Z] 4568b05f || INFO: Pod exp-1-trial-1-rank-0-4f28425a-7a10-40a5-a9d8-393134c7e607-mint-egret: Container image "determinedai/environments:cuda-10.2-pytorch-1.7-tf-1.15-gpu-da9ba40" already present on machine
[2021-05-06T03:34:25Z] 4568b05f || INFO: Pod exp-1-trial-1-rank-0-4f28425a-7a10-40a5-a9d8-393134c7e607-mint-egret: Created container determined-container
[2021-05-06T03:34:26Z] 4568b05f || INFO: Pod exp-1-trial-1-rank-0-4f28425a-7a10-40a5-a9d8-393134c7e607-mint-egret: Started container determined-container
[2021-05-06T03:34:29Z] 4568b05f || + WORKING_DIR=/run/determined/workdir
[2021-05-06T03:34:29Z] 4568b05f || + STARTUP_HOOK=startup-hook.sh
[2021-05-06T03:34:29Z] 4568b05f || + PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
[2021-05-06T03:34:29Z] 4568b05f || + export PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
[2021-05-06T03:34:29Z] 4568b05f || + '[' -z '' ']'
[2021-05-06T03:34:29Z] 4568b05f || + export DET_PYTHON_EXECUTABLE=python3
[2021-05-06T03:34:29Z] 4568b05f || + DET_PYTHON_EXECUTABLE=python3
[2021-05-06T03:34:29Z] 4568b05f || + /bin/which python3
[2021-05-06T03:34:29Z] 4568b05f || + '[' /root = / ']'
[2021-05-06T03:34:29Z] 4568b05f || + python3 -m pip install -q --user /opt/determined/wheels/determined-0.15.2-py3-none-any.whl
[2021-05-06T03:34:29Z] 4568b05f || ERROR: determined 0.15.2 has requirement ruamel.yaml>=0.15.78, but you'll have ruamel-yaml 0.15.46 which is incompatible.
[2021-05-06T03:34:29Z] 4568b05f || + cd /run/determined/workdir
[2021-05-06T03:34:29Z] 4568b05f || + test -f startup-hook.sh
[2021-05-06T03:34:29Z] 4568b05f || + exec python3 -m determined.exec.harness
[2021-05-06T03:34:29Z] 4568b05f || INFO: New trial runner in (container 4568b05f-8693-4379-9856-eaa3768b5d11) on agent k8agent: {'master_addr': '192.168.245.182', 'master_port': 8080, 'use_tls': 0, 'master_cert_file': None, 'master_cert_name': None, 'container_id': '4568b05f-8693-4379-9856-eaa3768b5d11', 'experiment_config': {'description': 'mnist_pytorch_const', 'data': {'url': 'https://s3-us-west-2.amazonaws.com/determined-ai-test-data/pytorch_mnist.tar.gz'}, 'checkpoint_storage': {'host_path': '/checkpoints', 'save_experiment_best': 0, 'save_trial_best': 1, 'save_trial_latest': 1, 'type': 'shared_fs'}, 'perform_initial_validation': False, 'min_checkpoint_period': {'batches': 0}, 'min_validation_period': {'batches': 0}, 'checkpoint_policy': 'best', 'hyperparameters': {'dropout1': {'type': 'const', 'val': 0.25}, 'dropout2': {'type': 'const', 'val': 0.5}, 'global_batch_size': {'type': 'const', 'val': 64}, 'learning_rate': {'type': 'const', 'val': 1}, 'n_filters1': {'type': 'const', 'val': 32}, 'n_filters2': {'type': 'const', 'val': 64}}, 'searcher': {'max_length': {'batches': 937}, 'metric': 'validation_loss', 'name': 'single', 'smaller_is_better': True, 'source_checkpoint_uuid': None, 'source_trial_id': None}, 'resources': {'slots_per_trial': 1, 'weight': 1, 'native_parallel': False, 'agent_label': '', 'resource_pool': '', 'devices': None}, 'optimizations': {'aggregation_frequency': 1, 'average_aggregated_gradients': True, 'average_training_metrics': False, 'gradient_compression': False, 'mixed_precision': 'O0', 'tensor_fusion_threshold': 64, 'tensor_fusion_cycle_time': 5, 'auto_tune_tensor_fusion': False}, 'records_per_epoch': 0, 'scheduling_unit': 100, 'environment': {'image': {'cpu': 'determinedai/environments:py-3.7-pytorch-1.7-tf-1.15-cpu-da9ba40', 'gpu': 'determinedai/environments:cuda-10.2-pytorch-1.7-tf-1.15-gpu-da9ba40'}, 'environment_variables': {}, 'ports': None, 'force_pull_image': False, 'pod_spec': None, 'add_capabilities': None, 'drop_capabilities': None}, 'reproducibility': {'experiment_seed': 1620272038}, 'max_restarts': 5, 'debug': False, 'internal': None, 'entrypoint': 'model_def:MNistTrial', 'data_layer': {'container_storage_path': None, 'type': 'shared_fs'}, 'profiling': {'enabled': False, 'begin_on_batch': 0, 'end_after_batch': 0}}, 'hparams': {'dropout1': 0.25, 'dropout2': 0.5, 'global_batch_size': 64, 'learning_rate': 1, 'n_filters1': 32, 'n_filters2': 64}, 'initial_workload': <RUN_STEP (100 Batches): (1,1,1)>, 'latest_checkpoint': None, 'use_gpu': 1, 'container_gpus': ['GPU-a523e020-b504-535c-2b83-967ab28cdbab'], 'slot_ids': [0], 'debug': False, 'workload_manager_type': 'TRIAL_WORKLOAD_MANAGER', 'det_rendezvous_ports': '1734,1750', 'det_trial_unique_port_offset': 0, 'det_trial_runner_network_interface': 'DET_AUTO_DETECT_NETWORK_INTERFACE', 'det_trial_id': '1', 'det_experiment_id': '1', 'det_cluster_id': '6d8bea20-a491-4451-953b-c3d093960076', 'trial_seed': 1001305543, 'managed_training': True, 'test_mode': False, 'on_cluster': True, '_per_slot_batch_size': 64, '_global_batch_size': 64}.
[2021-05-06T03:34:29Z] 4568b05f || INFO: Connecting to master at ws://192.168.245.182:8080/ws/trial/1/1/4568b05f-8693-4379-9856-eaa3768b5d11
[2021-05-06T03:34:29Z] 4568b05f || INFO: Connected to master
[2021-05-06T03:34:29Z] 4568b05f || INFO: Established WebSocket session with master
[2021-05-06T03:34:29Z] 4568b05f || INFO: Got rendezvous information: {'addrs': ['100.122.176.248:1734'], 'addrs2': ['100.122.176.248:1750'], 'containers': [{'addresses': [{'container_ip': '100.122.176.248', 'container_port': 1734, 'host_ip': '100.122.176.248', 'host_port': 1734}, {'container_ip': '100.122.176.248', 'container_port': 1750, 'host_ip': '100.122.176.248', 'host_port': 1750}]}], 'rank': 0, 'type': 'RENDEZVOUS_INFO'}
[2021-05-06T03:34:31Z] 4568b05f || INFO: Horovod config: {'use': False, 'aggregation_frequency': 1, 'fp16_compression': False, 'grad_updates_size_file': None, 'average_aggregated_gradients': True, 'average_training_metrics': False}.
[2021-05-06T03:34:31Z] 4568b05f || INFO: Loading Trial implementation with entrypoint model_def:MNistTrial.
[2021-05-06T03:34:31Z] 4568b05f || /opt/conda/lib/python3.7/site-packages/torch/cuda/__init__.py:104: UserWarning:
[2021-05-06T03:34:31Z] 4568b05f || GeForce RTX 3090 with CUDA capability sm_86 is not compatible with the current PyTorch installation.
[2021-05-06T03:34:31Z] 4568b05f || The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70 sm_75.
[2021-05-06T03:34:31Z] 4568b05f || If you want to use the GeForce RTX 3090 GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/
[2021-05-06T03:34:31Z] 4568b05f || warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))
[2021-05-06T03:34:33Z] 4568b05f || INFO: Creating PyTorchTrialController with MNistTrial.
[2021-05-06T03:34:33Z] 4568b05f || INFO: Downloading https://s3-us-west-2.amazonaws.com/determined-ai-test-data/pytorch_mnist.tar.gz to /tmp/data-rank0/MNIST/pytorch_mnist.tar.gz
[2021-05-06T03:34:39Z] 4568b05f || INFO: Running workload <RUN_STEP (100 Batches): (1,1,1)>
[2021-05-06T03:34:39Z] 4568b05f || INFO: WebSocket closed
[2021-05-06T03:34:39Z] 4568b05f || INFO: Disconnected from master, exiting gracefully
[2021-05-06T03:34:39Z] 4568b05f || Traceback (most recent call last):
[2021-05-06T03:34:39Z] 4568b05f || File "/opt/conda/lib/python3.7/runpy.py", line 193, in _run_module_as_main
[2021-05-06T03:34:39Z] 4568b05f || "__main__", mod_spec)
[2021-05-06T03:34:39Z] 4568b05f || File "/opt/conda/lib/python3.7/runpy.py", line 85, in _run_code
[2021-05-06T03:34:39Z] 4568b05f || exec(code, run_globals)
[2021-05-06T03:34:39Z] 4568b05f || File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/exec/harness.py", line 236, in <module>
[2021-05-06T03:34:39Z] 4568b05f || main()
[2021-05-06T03:34:39Z] 4568b05f || File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/exec/harness.py", line 229, in main
[2021-05-06T03:34:39Z] 4568b05f || build_and_run_training_pipeline(env)
[2021-05-06T03:34:39Z] 4568b05f || File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/exec/harness.py", line 149, in build_and_run_training_pipeline
[2021-05-06T03:34:39Z] 4568b05f || controller.run()
[2021-05-06T03:34:39Z] 4568b05f || File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/pytorch/_pytorch_trial.py", line 152, in run
[2021-05-06T03:34:39Z] 4568b05f || w.total_batches_processed,
[2021-05-06T03:34:39Z] 4568b05f || File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/pytorch/_pytorch_trial.py", line 310, in _train_for_step
[2021-05-06T03:34:39Z] 4568b05f || batch_idx=batch_idx,
[2021-05-06T03:34:39Z] 4568b05f || File "/run/determined/workdir/model_def.py", line 84, in train_batch
[2021-05-06T03:34:39Z] 4568b05f || output = self.model(data)
[2021-05-06T03:34:39Z] 4568b05f || File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
[2021-05-06T03:34:39Z] 4568b05f || result = self.forward(*input, **kwargs)
[2021-05-06T03:34:39Z] 4568b05f || File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/container.py", line 117, in forward
[2021-05-06T03:34:39Z] 4568b05f || input = module(input)
[2021-05-06T03:34:39Z] 4568b05f || File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
[2021-05-06T03:34:39Z] 4568b05f || result = self.forward(*input, **kwargs)
[2021-05-06T03:34:39Z] 4568b05f || File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 423, in forward
[2021-05-06T03:34:39Z] 4568b05f || File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 420, in _conv_forward
[2021-05-06T03:34:39Z] 4568b05f || return self._conv_forward(input, self.weight)
[2021-05-06T03:34:39Z] 4568b05f || self.padding, self.dilation, self.groups)
[2021-05-06T03:34:39Z] 4568b05f || RuntimeError: CUDA error: no kernel image is available for execution on the device
[2021-05-06T03:34:40Z] 4568b05f || INFO: container failed with non-zero exit code: (exit code 1)
[2021-05-06T03:34:41Z] 7047ec73 || INFO: Pod exp-1-trial-1-rank-0-25b5c65e-2321-464c-ab54-aac345a9dc60-positive-monkey: Pod resources allocated.
[2021-05-06T03:34:42Z] 7047ec73 || INFO: Pod exp-1-trial-1-rank-0-25b5c65e-2321-464c-ab54-aac345a9dc60-positive-monkey: Container image "determinedai/environments:cuda-10.2-pytorch-1.7-tf-1.15-gpu-da9ba40" already present on machine
[2021-05-06T03:34:42Z] 7047ec73 || INFO: Pod exp-1-trial-1-rank-0-25b5c65e-2321-464c-ab54-aac345a9dc60-positive-monkey: Created container determined-init-container
[2021-05-06T03:34:42Z] 7047ec73 || INFO: Pod exp-1-trial-1-rank-0-25b5c65e-2321-464c-ab54-aac345a9dc60-positive-monkey: Started container determined-init-container
[2021-05-06T03:34:43Z] 7047ec73 || INFO: Pod exp-1-trial-1-rank-0-25b5c65e-2321-464c-ab54-aac345a9dc60-positive-monkey: Container image "fluent/fluent-bit:1.6" already present on machine
[2021-05-06T03:34:43Z] 7047ec73 || INFO: Pod exp-1-trial-1-rank-0-25b5c65e-2321-464c-ab54-aac345a9dc60-positive-monkey: Created container determined-fluent-container
[2021-05-06T03:34:44Z] 7047ec73 || INFO: Pod exp-1-trial-1-rank-0-25b5c65e-2321-464c-ab54-aac345a9dc60-positive-monkey: Started container determined-fluent-container
[2021-05-06T03:34:44Z] 7047ec73 || INFO: Pod exp-1-trial-1-rank-0-25b5c65e-2321-464c-ab54-aac345a9dc60-positive-monkey: Container image "determinedai/environments:cuda-10.2-pytorch-1.7-tf-1.15-gpu-da9ba40" already present on machine
[2021-05-06T03:34:45Z] 7047ec73 || INFO: Pod exp-1-trial-1-rank-0-25b5c65e-2321-464c-ab54-aac345a9dc60-positive-monkey: Created container determined-container
[2021-05-06T03:34:46Z] 7047ec73 || INFO: Pod exp-1-trial-1-rank-0-25b5c65e-2321-464c-ab54-aac345a9dc60-positive-monkey: Started container determined-container
[2021-05-06T03:34:49Z] 7047ec73 || + WORKING_DIR=/run/determined/workdir
[2021-05-06T03:34:49Z] 7047ec73 || + STARTUP_HOOK=startup-hook.sh
[2021-05-06T03:34:49Z] 7047ec73 || + export PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
[2021-05-06T03:34:49Z] 7047ec73 || + PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
[2021-05-06T03:34:49Z] 7047ec73 || + '[' -z '' ']'
[2021-05-06T03:34:49Z] 7047ec73 || + export DET_PYTHON_EXECUTABLE=python3
[2021-05-06T03:34:49Z] 7047ec73 || + DET_PYTHON_EXECUTABLE=python3
[2021-05-06T03:34:49Z] 7047ec73 || + /bin/which python3
[2021-05-06T03:34:49Z] 7047ec73 || + python3 -m pip install -q --user /opt/determined/wheels/determined-0.15.2-py3-none-any.whl
[2021-05-06T03:34:49Z] 7047ec73 || + '[' /root = / ']'
[2021-05-06T03:34:50Z] 7047ec73 || ERROR: determined 0.15.2 has requirement ruamel.yaml>=0.15.78, but you'll have ruamel-yaml 0.15.46 which is incompatible.
[2021-05-06T03:34:50Z] 7047ec73 || + cd /run/determined/workdir
[2021-05-06T03:34:50Z] 7047ec73 || + test -f startup-hook.sh
[2021-05-06T03:34:50Z] 7047ec73 || + exec python3 -m determined.exec.harness
[2021-05-06T03:34:51Z] 7047ec73 || INFO: New trial runner in (container 7047ec73-a6b0-4193-94b8-97dd94eca9b9) on agent k8agent: {'master_addr': '192.168.245.182', 'master_port': 8080, 'use_tls': 0, 'master_cert_file': None, 'master_cert_name': None, 'container_id': '7047ec73-a6b0-4193-94b8-97dd94eca9b9', 'experiment_config': {'description': 'mnist_pytorch_const', 'data': {'url': 'https://s3-us-west-2.amazonaws.com/determined-ai-test-data/pytorch_mnist.tar.gz'}, 'checkpoint_storage': {'host_path': '/checkpoints', 'save_experiment_best': 0, 'save_trial_best': 1, 'save_trial_latest': 1, 'type': 'shared_fs'}, 'perform_initial_validation': False, 'min_checkpoint_period': {'batches': 0}, 'min_validation_period': {'batches': 0}, 'checkpoint_policy': 'best', 'hyperparameters': {'dropout1': {'type': 'const', 'val': 0.25}, 'dropout2': {'type': 'const', 'val': 0.5}, 'global_batch_size': {'type': 'const', 'val': 64}, 'learning_rate': {'type': 'const', 'val': 1}, 'n_filters1': {'type': 'const', 'val': 32}, 'n_filters2': {'type': 'const', 'val': 64}}, 'searcher': {'max_length': {'batches': 937}, 'metric': 'validation_loss', 'name': 'single', 'smaller_is_better': True, 'source_checkpoint_uuid': None, 'source_trial_id': None}, 'resources': {'slots_per_trial': 1, 'weight': 1, 'native_parallel': False, 'agent_label': '', 'resource_pool': '', 'devices': None}, 'optimizations': {'aggregation_frequency': 1, 'average_aggregated_gradients': True, 'average_training_metrics': False, 'gradient_compression': False, 'mixed_precision': 'O0', 'tensor_fusion_threshold': 64, 'tensor_fusion_cycle_time': 5, 'auto_tune_tensor_fusion': False}, 'records_per_epoch': 0, 'scheduling_unit': 100, 'environment': {'image': {'cpu': 'determinedai/environments:py-3.7-pytorch-1.7-tf-1.15-cpu-da9ba40', 'gpu': 'determinedai/environments:cuda-10.2-pytorch-1.7-tf-1.15-gpu-da9ba40'}, 'environment_variables': {}, 'ports': None, 'force_pull_image': False, 'pod_spec': None, 'add_capabilities': None, 'drop_capabilities': None}, 'reproducibility': {'experiment_seed': 1620272038}, 'max_restarts': 5, 'debug': False, 'internal': None, 'entrypoint': 'model_def:MNistTrial', 'data_layer': {'container_storage_path': None, 'type': 'shared_fs'}, 'profiling': {'enabled': False, 'begin_on_batch': 0, 'end_after_batch': 0}}, 'hparams': {'dropout1': 0.25, 'dropout2': 0.5, 'global_batch_size': 64, 'learning_rate': 1, 'n_filters1': 32, 'n_filters2': 64}, 'initial_workload': <RUN_STEP (100 Batches): (1,1,1)>, 'latest_checkpoint': None, 'use_gpu': 1, 'container_gpus': ['GPU-8e73c217-85aa-c8c5-2551-52c59279e009'], 'slot_ids': [0], 'debug': False, 'workload_manager_type': 'TRIAL_WORKLOAD_MANAGER', 'det_rendezvous_ports': '1734,1750', 'det_trial_unique_port_offset': 0, 'det_trial_runner_network_interface': 'DET_AUTO_DETECT_NETWORK_INTERFACE', 'det_trial_id': '1', 'det_experiment_id': '1', 'det_cluster_id': '6d8bea20-a491-4451-953b-c3d093960076', 'trial_seed': 1001305543, 'managed_training': True, 'test_mode': False, 'on_cluster': True, '_per_slot_batch_size': 64, '_global_batch_size': 64}.
[2021-05-06T03:34:51Z] 7047ec73 || INFO: Connecting to master at ws://192.168.245.182:8080/ws/trial/1/1/7047ec73-a6b0-4193-94b8-97dd94eca9b9
[2021-05-06T03:34:51Z] 7047ec73 || INFO: Connected to master
[2021-05-06T03:34:51Z] 7047ec73 || INFO: Established WebSocket session with master
[2021-05-06T03:34:51Z] 7047ec73 || INFO: Got rendezvous information: {'addrs': ['100.122.176.249:1734'], 'addrs2': ['100.122.176.249:1750'], 'containers': [{'addresses': [{'container_ip': '100.122.176.249', 'container_port': 1734, 'host_ip': '100.122.176.249', 'host_port': 1734}, {'container_ip': '100.122.176.249', 'container_port': 1750, 'host_ip': '100.122.176.249', 'host_port': 1750}]}], 'rank': 0, 'type': 'RENDEZVOUS_INFO'}
[2021-05-06T03:34:52Z] 7047ec73 || INFO: Horovod config: {'use': False, 'aggregation_frequency': 1, 'fp16_compression': False, 'grad_updates_size_file': None, 'average_aggregated_gradients': True, 'average_training_metrics': False}.
[2021-05-06T03:34:52Z] 7047ec73 || INFO: Loading Trial implementation with entrypoint model_def:MNistTrial.
[2021-05-06T03:34:53Z] 7047ec73 || /opt/conda/lib/python3.7/site-packages/torch/cuda/__init__.py:104: UserWarning:
[2021-05-06T03:34:53Z] 7047ec73 || GeForce RTX 3090 with CUDA capability sm_86 is not compatible with the current PyTorch installation.
[2021-05-06T03:34:53Z] 7047ec73 || If you want to use the GeForce RTX 3090 GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/
[2021-05-06T03:34:53Z] 7047ec73 || The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70 sm_75.
[2021-05-06T03:34:53Z] 7047ec73 || warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))
[2021-05-06T03:34:55Z] 7047ec73 || INFO: Creating PyTorchTrialController with MNistTrial.
[2021-05-06T03:34:55Z] 7047ec73 || INFO: Downloading https://s3-us-west-2.amazonaws.com/determined-ai-test-data/pytorch_mnist.tar.gz to /tmp/data-rank0/MNIST/pytorch_mnist.tar.gz
[2021-05-06T03:35:02Z] 7047ec73 || INFO: Running workload <RUN_STEP (100 Batches): (1,1,1)>
[2021-05-06T03:35:02Z] 7047ec73 || INFO: WebSocket closed
[2021-05-06T03:35:02Z] 7047ec73 || INFO: Disconnected from master, exiting gracefully
[2021-05-06T03:35:02Z] 7047ec73 || Traceback (most recent call last):
[2021-05-06T03:35:02Z] 7047ec73 || File "/opt/conda/lib/python3.7/runpy.py", line 193, in _run_module_as_main
[2021-05-06T03:35:02Z] 7047ec73 || "__main__", mod_spec)
[2021-05-06T03:35:02Z] 7047ec73 || File "/opt/conda/lib/python3.7/runpy.py", line 85, in _run_code
[2021-05-06T03:35:02Z] 7047ec73 || exec(code, run_globals)
[2021-05-06T03:35:02Z] 7047ec73 || File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/exec/harness.py", line 236, in <module>
[2021-05-06T03:35:02Z] 7047ec73 || main()
[2021-05-06T03:35:02Z] 7047ec73 || File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/exec/harness.py", line 229, in main
[2021-05-06T03:35:02Z] 7047ec73 || build_and_run_training_pipeline(env)
[2021-05-06T03:35:02Z] 7047ec73 || File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/exec/harness.py", line 149, in build_and_run_training_pipeline
[2021-05-06T03:35:02Z] 7047ec73 || controller.run()
[2021-05-06T03:35:02Z] 7047ec73 || File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/pytorch/_pytorch_trial.py", line 152, in run
[2021-05-06T03:35:02Z] 7047ec73 || w.total_batches_processed,
[2021-05-06T03:35:02Z] 7047ec73 || File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/pytorch/_pytorch_trial.py", line 310, in _train_for_step
[2021-05-06T03:35:02Z] 7047ec73 || batch_idx=batch_idx,
[2021-05-06T03:35:02Z] 7047ec73 || File "/run/determined/workdir/model_def.py", line 84, in train_batch
[2021-05-06T03:35:02Z] 7047ec73 || output = self.model(data)
[2021-05-06T03:35:02Z] 7047ec73 || File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
[2021-05-06T03:35:02Z] 7047ec73 || result = self.forward(*input, **kwargs)
[2021-05-06T03:35:02Z] 7047ec73 || File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/container.py", line 117, in forward
[2021-05-06T03:35:02Z] 7047ec73 || input = module(input)
[2021-05-06T03:35:02Z] 7047ec73 || File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
[2021-05-06T03:35:02Z] 7047ec73 || result = self.forward(*input, **kwargs)
[2021-05-06T03:35:02Z] 7047ec73 || File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 423, in forward
[2021-05-06T03:35:02Z] 7047ec73 || return self._conv_forward(input, self.weight)
[2021-05-06T03:35:02Z] 7047ec73 || File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 420, in _conv_forward
[2021-05-06T03:35:02Z] 7047ec73 || self.padding, self.dilation, self.groups)
[2021-05-06T03:35:02Z] 7047ec73 || RuntimeError: CUDA error: no kernel image is available for execution on the device
[2021-05-06T03:35:03Z] 7047ec73 || INFO: container failed with non-zero exit code: (exit code 1)
[2021-05-06T03:35:04Z] fd7316ae || INFO: Pod exp-1-trial-1-rank-0-6d05a1bc-de12-4308-8f59-2002c5c6c136-together-piranha: Pod resources allocated.
[2021-05-06T03:35:05Z] fd7316ae || INFO: Pod exp-1-trial-1-rank-0-6d05a1bc-de12-4308-8f59-2002c5c6c136-together-piranha: Container image "determinedai/environments:cuda-10.2-pytorch-1.7-tf-1.15-gpu-da9ba40" already present on machine
[2021-05-06T03:35:05Z] fd7316ae || INFO: Pod exp-1-trial-1-rank-0-6d05a1bc-de12-4308-8f59-2002c5c6c136-together-piranha: Created container determined-init-container
[2021-05-06T03:35:05Z] fd7316ae || INFO: Pod exp-1-trial-1-rank-0-6d05a1bc-de12-4308-8f59-2002c5c6c136-together-piranha: Started container determined-init-container
[2021-05-06T03:35:06Z] fd7316ae || INFO: Pod exp-1-trial-1-rank-0-6d05a1bc-de12-4308-8f59-2002c5c6c136-together-piranha: Container image "fluent/fluent-bit:1.6" already present on machine
[2021-05-06T03:35:06Z] fd7316ae || INFO: Pod exp-1-trial-1-rank-0-6d05a1bc-de12-4308-8f59-2002c5c6c136-together-piranha: Created container determined-fluent-container
[2021-05-06T03:35:07Z] fd7316ae || INFO: Pod exp-1-trial-1-rank-0-6d05a1bc-de12-4308-8f59-2002c5c6c136-together-piranha: Started container determined-fluent-container
[2021-05-06T03:35:07Z] fd7316ae || INFO: Pod exp-1-trial-1-rank-0-6d05a1bc-de12-4308-8f59-2002c5c6c136-together-piranha: Container image "determinedai/environments:cuda-10.2-pytorch-1.7-tf-1.15-gpu-da9ba40" already present on machine
[2021-05-06T03:35:08Z] fd7316ae || INFO: Pod exp-1-trial-1-rank-0-6d05a1bc-de12-4308-8f59-2002c5c6c136-together-piranha: Created container determined-container
[2021-05-06T03:35:09Z] fd7316ae || INFO: Pod exp-1-trial-1-rank-0-6d05a1bc-de12-4308-8f59-2002c5c6c136-together-piranha: Started container determined-container
[2021-05-06T03:35:12Z] fd7316ae || + WORKING_DIR=/run/determined/workdir
[2021-05-06T03:35:12Z] fd7316ae || + STARTUP_HOOK=startup-hook.sh
[2021-05-06T03:35:12Z] fd7316ae || + export PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
[2021-05-06T03:35:12Z] fd7316ae || + PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
[2021-05-06T03:35:12Z] fd7316ae || + '[' -z '' ']'
[2021-05-06T03:35:12Z] fd7316ae || + export DET_PYTHON_EXECUTABLE=python3
[2021-05-06T03:35:12Z] fd7316ae || + DET_PYTHON_EXECUTABLE=python3
[2021-05-06T03:35:12Z] fd7316ae || + /bin/which python3
[2021-05-06T03:35:12Z] fd7316ae || + '[' /root = / ']'
[2021-05-06T03:35:12Z] fd7316ae || + python3 -m pip install -q --user /opt/determined/wheels/determined-0.15.2-py3-none-any.whl
[2021-05-06T03:35:12Z] fd7316ae || ERROR: determined 0.15.2 has requirement ruamel.yaml>=0.15.78, but you'll have ruamel-yaml 0.15.46 which is incompatible.
[2021-05-06T03:35:12Z] fd7316ae || + test -f startup-hook.sh
[2021-05-06T03:35:12Z] fd7316ae || + cd /run/determined/workdir
[2021-05-06T03:35:12Z] fd7316ae || + exec python3 -m determined.exec.harness
[2021-05-06T03:35:13Z] fd7316ae || INFO: New trial runner in (container fd7316ae-4cd1-4324-ab6c-a060b0cea14d) on agent k8agent: {'master_addr': '192.168.245.182', 'master_port': 8080, 'use_tls': 0, 'master_cert_file': None, 'master_cert_name': None, 'container_id': 'fd7316ae-4cd1-4324-ab6c-a060b0cea14d', 'experiment_config': {'description': 'mnist_pytorch_const', 'data': {'url': 'https://s3-us-west-2.amazonaws.com/determined-ai-test-data/pytorch_mnist.tar.gz'}, 'checkpoint_storage': {'host_path': '/checkpoints', 'save_experiment_best': 0, 'save_trial_best': 1, 'save_trial_latest': 1, 'type': 'shared_fs'}, 'perform_initial_validation': False, 'min_checkpoint_period': {'batches': 0}, 'min_validation_period': {'batches': 0}, 'checkpoint_policy': 'best', 'hyperparameters': {'dropout1': {'type': 'const', 'val': 0.25}, 'dropout2': {'type': 'const', 'val': 0.5}, 'global_batch_size': {'type': 'const', 'val': 64}, 'learning_rate': {'type': 'const', 'val': 1}, 'n_filters1': {'type': 'const', 'val': 32}, 'n_filters2': {'type': 'const', 'val': 64}}, 'searcher': {'max_length': {'batches': 937}, 'metric': 'validation_loss', 'name': 'single', 'smaller_is_better': True, 'source_checkpoint_uuid': None, 'source_trial_id': None}, 'resources': {'slots_per_trial': 1, 'weight': 1, 'native_parallel': False, 'agent_label': '', 'resource_pool': '', 'devices': None}, 'optimizations': {'aggregation_frequency': 1, 'average_aggregated_gradients': True, 'average_training_metrics': False, 'gradient_compression': False, 'mixed_precision': 'O0', 'tensor_fusion_threshold': 64, 'tensor_fusion_cycle_time': 5, 'auto_tune_tensor_fusion': False}, 'records_per_epoch': 0, 'scheduling_unit': 100, 'environment': {'image': {'cpu': 'determinedai/environments:py-3.7-pytorch-1.7-tf-1.15-cpu-da9ba40', 'gpu': 'determinedai/environments:cuda-10.2-pytorch-1.7-tf-1.15-gpu-da9ba40'}, 'environment_variables': {}, 'ports': None, 'force_pull_image': False, 'pod_spec': None, 'add_capabilities': None, 'drop_capabilities': None}, 'reproducibility': {'experiment_seed': 1620272038}, 'max_restarts': 5, 'debug': False, 'internal': None, 'entrypoint': 'model_def:MNistTrial', 'data_layer': {'container_storage_path': None, 'type': 'shared_fs'}, 'profiling': {'enabled': False, 'begin_on_batch': 0, 'end_after_batch': 0}}, 'hparams': {'dropout1': 0.25, 'dropout2': 0.5, 'global_batch_size': 64, 'learning_rate': 1, 'n_filters1': 32, 'n_filters2': 64}, 'initial_workload': <RUN_STEP (100 Batches): (1,1,1)>, 'latest_checkpoint': None, 'use_gpu': 1, 'container_gpus': ['GPU-2bfb4f26-19eb-b65f-08b1-1de7d47801e2'], 'slot_ids': [0], 'debug': False, 'workload_manager_type': 'TRIAL_WORKLOAD_MANAGER', 'det_rendezvous_ports': '1734,1750', 'det_trial_unique_port_offset': 0, 'det_trial_runner_network_interface': 'DET_AUTO_DETECT_NETWORK_INTERFACE', 'det_trial_id': '1', 'det_experiment_id': '1', 'det_cluster_id': '6d8bea20-a491-4451-953b-c3d093960076', 'trial_seed': 1001305543, 'managed_training': True, 'test_mode': False, 'on_cluster': True, '_per_slot_batch_size': 64, '_global_batch_size': 64}.
[2021-05-06T03:35:13Z] fd7316ae || INFO: Connecting to master at ws://192.168.245.182:8080/ws/trial/1/1/fd7316ae-4cd1-4324-ab6c-a060b0cea14d
[2021-05-06T03:35:13Z] fd7316ae || INFO: Connected to master
[2021-05-06T03:35:13Z] fd7316ae || INFO: Established WebSocket session with master
[2021-05-06T03:35:13Z] fd7316ae || INFO: Got rendezvous information: {'addrs': ['100.122.176.250:1734'], 'addrs2': ['100.122.176.250:1750'], 'containers': [{'addresses': [{'container_ip': '100.122.176.250', 'container_port': 1734, 'host_ip': '100.122.176.250', 'host_port': 1734}, {'container_ip': '100.122.176.250', 'container_port': 1750, 'host_ip': '100.122.176.250', 'host_port': 1750}]}], 'rank': 0, 'type': 'RENDEZVOUS_INFO'}
[2021-05-06T03:35:14Z] fd7316ae || INFO: Horovod config: {'use': False, 'aggregation_frequency': 1, 'fp16_compression': False, 'grad_updates_size_file': None, 'average_aggregated_gradients': True, 'average_training_metrics': False}.
[2021-05-06T03:35:14Z] fd7316ae || INFO: Loading Trial implementation with entrypoint model_def:MNistTrial.
[2021-05-06T03:35:15Z] fd7316ae || /opt/conda/lib/python3.7/site-packages/torch/cuda/__init__.py:104: UserWarning:
[2021-05-06T03:35:15Z] fd7316ae || GeForce RTX 3090 with CUDA capability sm_86 is not compatible with the current PyTorch installation.
[2021-05-06T03:35:15Z] fd7316ae || The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70 sm_75.
[2021-05-06T03:35:15Z] fd7316ae || If you want to use the GeForce RTX 3090 GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/
[2021-05-06T03:35:15Z] fd7316ae || warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))
[2021-05-06T03:35:16Z] fd7316ae || INFO: Creating PyTorchTrialController with MNistTrial.
[2021-05-06T03:35:16Z] fd7316ae || INFO: Downloading https://s3-us-west-2.amazonaws.com/determined-ai-test-data/pytorch_mnist.tar.gz to /tmp/data-rank0/MNIST/pytorch_mnist.tar.gz
[2021-05-06T03:35:29Z] fd7316ae || INFO: Running workload <RUN_STEP (100 Batches): (1,1,1)>
[2021-05-06T03:35:29Z] fd7316ae || INFO: WebSocket closed
[2021-05-06T03:35:29Z] fd7316ae || INFO: Disconnected from master, exiting gracefully
[2021-05-06T03:35:29Z] fd7316ae || Traceback (most recent call last):
[2021-05-06T03:35:29Z] fd7316ae || File "/opt/conda/lib/python3.7/runpy.py", line 193, in _run_module_as_main
[2021-05-06T03:35:29Z] fd7316ae || "__main__", mod_spec)
[2021-05-06T03:35:29Z] fd7316ae || File "/opt/conda/lib/python3.7/runpy.py", line 85, in _run_code
[2021-05-06T03:35:29Z] fd7316ae || exec(code, run_globals)
[2021-05-06T03:35:29Z] fd7316ae || File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/exec/harness.py", line 236, in <module>
[2021-05-06T03:35:29Z] fd7316ae || main()
[2021-05-06T03:35:29Z] fd7316ae || File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/exec/harness.py", line 229, in main
[2021-05-06T03:35:29Z] fd7316ae || build_and_run_training_pipeline(env)
[2021-05-06T03:35:29Z] fd7316ae || File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/exec/harness.py", line 149, in build_and_run_training_pipeline
[2021-05-06T03:35:29Z] fd7316ae || controller.run()
[2021-05-06T03:35:29Z] fd7316ae || File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/pytorch/_pytorch_trial.py", line 152, in run
[2021-05-06T03:35:29Z] fd7316ae || w.total_batches_processed,
[2021-05-06T03:35:29Z] fd7316ae || File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/pytorch/_pytorch_trial.py", line 310, in _train_for_step
[2021-05-06T03:35:29Z] fd7316ae || batch_idx=batch_idx,
[2021-05-06T03:35:29Z] fd7316ae || File "/run/determined/workdir/model_def.py", line 84, in train_batch
[2021-05-06T03:35:29Z] fd7316ae || output = self.model(data)
[2021-05-06T03:35:29Z] fd7316ae || File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
[2021-05-06T03:35:29Z] fd7316ae || result = self.forward(*input, **kwargs)
[2021-05-06T03:35:29Z] fd7316ae || File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/container.py", line 117, in forward
[2021-05-06T03:35:29Z] fd7316ae || input = module(input)
[2021-05-06T03:35:29Z] fd7316ae || File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
[2021-05-06T03:35:29Z] fd7316ae || result = self.forward(*input, **kwargs)
[2021-05-06T03:35:29Z] fd7316ae || File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 423, in forward
[2021-05-06T03:35:29Z] fd7316ae || File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 420, in _conv_forward
[2021-05-06T03:35:29Z] fd7316ae || return self._conv_forward(input, self.weight)
[2021-05-06T03:35:29Z] fd7316ae || self.padding, self.dilation, self.groups)
[2021-05-06T03:35:29Z] fd7316ae || RuntimeError: CUDA error: no kernel image is available for execution on the device
[2021-05-06T03:35:30Z] fd7316ae || INFO: container failed with non-zero exit code: (exit code 1)
[2021-05-06T03:35:31Z] a58c2d6c || INFO: Pod exp-1-trial-1-rank-0-f485dd98-dac6-468b-8ed7-e9fb49c71441-social-monster: Pod resources allocated.
[2021-05-06T03:35:32Z] a58c2d6c || INFO: Pod exp-1-trial-1-rank-0-f485dd98-dac6-468b-8ed7-e9fb49c71441-social-monster: Container image "determinedai/environments:cuda-10.2-pytorch-1.7-tf-1.15-gpu-da9ba40" already present on machine
[2021-05-06T03:35:32Z] a58c2d6c || INFO: Pod exp-1-trial-1-rank-0-f485dd98-dac6-468b-8ed7-e9fb49c71441-social-monster: Created container determined-init-container
[2021-05-06T03:35:33Z] a58c2d6c || INFO: Pod exp-1-trial-1-rank-0-f485dd98-dac6-468b-8ed7-e9fb49c71441-social-monster: Started container determined-init-container
[2021-05-06T03:35:34Z] a58c2d6c || INFO: Pod exp-1-trial-1-rank-0-f485dd98-dac6-468b-8ed7-e9fb49c71441-social-monster: Container image "fluent/fluent-bit:1.6" already present on machine
[2021-05-06T03:35:34Z] a58c2d6c || INFO: Pod exp-1-trial-1-rank-0-f485dd98-dac6-468b-8ed7-e9fb49c71441-social-monster: Created container determined-fluent-container
[2021-05-06T03:35:34Z] a58c2d6c || INFO: Pod exp-1-trial-1-rank-0-f485dd98-dac6-468b-8ed7-e9fb49c71441-social-monster: Started container determined-fluent-container
[2021-05-06T03:35:34Z] a58c2d6c || INFO: Pod exp-1-trial-1-rank-0-f485dd98-dac6-468b-8ed7-e9fb49c71441-social-monster: Container image "determinedai/environments:cuda-10.2-pytorch-1.7-tf-1.15-gpu-da9ba40" already present on machine
[2021-05-06T03:35:36Z] a58c2d6c || INFO: Pod exp-1-trial-1-rank-0-f485dd98-dac6-468b-8ed7-e9fb49c71441-social-monster: Created container determined-container
[2021-05-06T03:35:37Z] a58c2d6c || INFO: Pod exp-1-trial-1-rank-0-f485dd98-dac6-468b-8ed7-e9fb49c71441-social-monster: Started container determined-container
[2021-05-06T03:35:37Z] a58c2d6c || + WORKING_DIR=/run/determined/workdir
[2021-05-06T03:35:37Z] a58c2d6c || + STARTUP_HOOK=startup-hook.sh
[2021-05-06T03:35:37Z] a58c2d6c || + export PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
[2021-05-06T03:35:37Z] a58c2d6c || + PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
[2021-05-06T03:35:37Z] a58c2d6c || + '[' -z '' ']'
[2021-05-06T03:35:37Z] a58c2d6c || + export DET_PYTHON_EXECUTABLE=python3
[2021-05-06T03:35:37Z] a58c2d6c || + /bin/which python3
[2021-05-06T03:35:37Z] a58c2d6c || + DET_PYTHON_EXECUTABLE=python3
[2021-05-06T03:35:37Z] a58c2d6c || + '[' /root = / ']'
[2021-05-06T03:35:37Z] a58c2d6c || + python3 -m pip install -q --user /opt/determined/wheels/determined-0.15.2-py3-none-any.whl
[2021-05-06T03:35:39Z] a58c2d6c || ERROR: determined 0.15.2 has requirement ruamel.yaml>=0.15.78, but you'll have ruamel-yaml 0.15.46 which is incompatible.
[2021-05-06T03:35:39Z] a58c2d6c || + cd /run/determined/workdir
[2021-05-06T03:35:39Z] a58c2d6c || + test -f startup-hook.sh
[2021-05-06T03:35:39Z] a58c2d6c || + exec python3 -m determined.exec.harness
[2021-05-06T03:35:40Z] a58c2d6c || INFO: New trial runner in (container a58c2d6c-1799-4eca-8af4-b5854c1ba447) on agent k8agent: {'master_addr': '192.168.245.182', 'master_port': 8080, 'use_tls': 0, 'master_cert_file': None, 'master_cert_name': None, 'container_id': 'a58c2d6c-1799-4eca-8af4-b5854c1ba447', 'experiment_config': {'description': 'mnist_pytorch_const', 'data': {'url': 'https://s3-us-west-2.amazonaws.com/determined-ai-test-data/pytorch_mnist.tar.gz'}, 'checkpoint_storage': {'host_path': '/checkpoints', 'save_experiment_best': 0, 'save_trial_best': 1, 'save_trial_latest': 1, 'type': 'shared_fs'}, 'perform_initial_validation': False, 'min_checkpoint_period': {'batches': 0}, 'min_validation_period': {'batches': 0}, 'checkpoint_policy': 'best', 'hyperparameters': {'dropout1': {'type': 'const', 'val': 0.25}, 'dropout2': {'type': 'const', 'val': 0.5}, 'global_batch_size': {'type': 'const', 'val': 64}, 'learning_rate': {'type': 'const', 'val': 1}, 'n_filters1': {'type': 'const', 'val': 32}, 'n_filters2': {'type': 'const', 'val': 64}}, 'searcher': {'max_length': {'batches': 937}, 'metric': 'validation_loss', 'name': 'single', 'smaller_is_better': True, 'source_checkpoint_uuid': None, 'source_trial_id': None}, 'resources': {'slots_per_trial': 1, 'weight': 1, 'native_parallel': False, 'agent_label': '', 'resource_pool': '', 'devices': None}, 'optimizations': {'aggregation_frequency': 1, 'average_aggregated_gradients': True, 'average_training_metrics': False, 'gradient_compression': False, 'mixed_precision': 'O0', 'tensor_fusion_threshold': 64, 'tensor_fusion_cycle_time': 5, 'auto_tune_tensor_fusion': False}, 'records_per_epoch': 0, 'scheduling_unit': 100, 'environment': {'image': {'cpu': 'determinedai/environments:py-3.7-pytorch-1.7-tf-1.15-cpu-da9ba40', 'gpu': 'determinedai/environments:cuda-10.2-pytorch-1.7-tf-1.15-gpu-da9ba40'}, 'environment_variables': {}, 'ports': None, 'force_pull_image': False, 'pod_spec': None, 'add_capabilities': None, 'drop_capabilities': None}, 'reproducibility': {'experiment_seed': 1620272038}, 'max_restarts': 5, 'debug': False, 'internal': None, 'entrypoint': 'model_def:MNistTrial', 'data_layer': {'container_storage_path': None, 'type': 'shared_fs'}, 'profiling': {'enabled': False, 'begin_on_batch': 0, 'end_after_batch': 0}}, 'hparams': {'dropout1': 0.25, 'dropout2': 0.5, 'global_batch_size': 64, 'learning_rate': 1, 'n_filters1': 32, 'n_filters2': 64}, 'initial_workload': <RUN_STEP (100 Batches): (1,1,1)>, 'latest_checkpoint': None, 'use_gpu': 1, 'container_gpus': ['GPU-329704d2-485f-8e97-0c87-112c63f4201b'], 'slot_ids': [0], 'debug': False, 'workload_manager_type': 'TRIAL_WORKLOAD_MANAGER', 'det_rendezvous_ports': '1734,1750', 'det_trial_unique_port_offset': 0, 'det_trial_runner_network_interface': 'DET_AUTO_DETECT_NETWORK_INTERFACE', 'det_trial_id': '1', 'det_experiment_id': '1', 'det_cluster_id': '6d8bea20-a491-4451-953b-c3d093960076', 'trial_seed': 1001305543, 'managed_training': True, 'test_mode': False, 'on_cluster': True, '_per_slot_batch_size': 64, '_global_batch_size': 64}.
[2021-05-06T03:35:40Z] a58c2d6c || INFO: Connecting to master at ws://192.168.245.182:8080/ws/trial/1/1/a58c2d6c-1799-4eca-8af4-b5854c1ba447
[2021-05-06T03:35:40Z] a58c2d6c || INFO: Connected to master
[2021-05-06T03:35:40Z] a58c2d6c || INFO: Established WebSocket session with master
[2021-05-06T03:35:40Z] a58c2d6c || INFO: Got rendezvous information: {'addrs': ['100.122.176.251:1734'], 'addrs2': ['100.122.176.251:1750'], 'containers': [{'addresses': [{'container_ip': '100.122.176.251', 'container_port': 1734, 'host_ip': '100.122.176.251', 'host_port': 1734}, {'container_ip': '100.122.176.251', 'container_port': 1750, 'host_ip': '100.122.176.251', 'host_port': 1750}]}], 'rank': 0, 'type': 'RENDEZVOUS_INFO'}
[2021-05-06T03:35:41Z] a58c2d6c || INFO: Horovod config: {'use': False, 'aggregation_frequency': 1, 'fp16_compression': False, 'grad_updates_size_file': None, 'average_aggregated_gradients': True, 'average_training_metrics': False}.
[2021-05-06T03:35:41Z] a58c2d6c || INFO: Loading Trial implementation with entrypoint model_def:MNistTrial.
[2021-05-06T03:35:42Z] a58c2d6c || /opt/conda/lib/python3.7/site-packages/torch/cuda/__init__.py:104: UserWarning:
[2021-05-06T03:35:42Z] a58c2d6c || GeForce RTX 3090 with CUDA capability sm_86 is not compatible with the current PyTorch installation.
[2021-05-06T03:35:42Z] a58c2d6c || If you want to use the GeForce RTX 3090 GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/
[2021-05-06T03:35:42Z] a58c2d6c || The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70 sm_75.
[2021-05-06T03:35:42Z] a58c2d6c || warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))
[2021-05-06T03:35:44Z] a58c2d6c || INFO: Creating PyTorchTrialController with MNistTrial.
[2021-05-06T03:35:44Z] a58c2d6c || INFO: Downloading https://s3-us-west-2.amazonaws.com/determined-ai-test-data/pytorch_mnist.tar.gz to /tmp/data-rank0/MNIST/pytorch_mnist.tar.gz
[2021-05-06T03:35:49Z] a58c2d6c || INFO: Running workload <RUN_STEP (100 Batches): (1,1,1)>
[2021-05-06T03:35:49Z] a58c2d6c || INFO: WebSocket closed
[2021-05-06T03:35:49Z] a58c2d6c || INFO: Disconnected from master, exiting gracefully
[2021-05-06T03:35:49Z] a58c2d6c || Traceback (most recent call last):
[2021-05-06T03:35:49Z] a58c2d6c || File "/opt/conda/lib/python3.7/runpy.py", line 193, in _run_module_as_main
[2021-05-06T03:35:49Z] a58c2d6c || "__main__", mod_spec)
[2021-05-06T03:35:49Z] a58c2d6c || File "/opt/conda/lib/python3.7/runpy.py", line 85, in _run_code
[2021-05-06T03:35:49Z] a58c2d6c || exec(code, run_globals)
[2021-05-06T03:35:49Z] a58c2d6c || File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/exec/harness.py", line 236, in <module>
[2021-05-06T03:35:49Z] a58c2d6c || main()
[2021-05-06T03:35:49Z] a58c2d6c || File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/exec/harness.py", line 229, in main
[2021-05-06T03:35:49Z] a58c2d6c || build_and_run_training_pipeline(env)
[2021-05-06T03:35:49Z] a58c2d6c || File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/exec/harness.py", line 149, in build_and_run_training_pipeline
[2021-05-06T03:35:49Z] a58c2d6c || controller.run()
[2021-05-06T03:35:49Z] a58c2d6c || File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/pytorch/_pytorch_trial.py", line 152, in run
[2021-05-06T03:35:49Z] a58c2d6c || w.total_batches_processed,
[2021-05-06T03:35:49Z] a58c2d6c || File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/pytorch/_pytorch_trial.py", line 310, in _train_for_step
[2021-05-06T03:35:49Z] a58c2d6c || batch_idx=batch_idx,
[2021-05-06T03:35:49Z] a58c2d6c || File "/run/determined/workdir/model_def.py", line 84, in train_batch
[2021-05-06T03:35:49Z] a58c2d6c || output = self.model(data)
[2021-05-06T03:35:49Z] a58c2d6c || File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
[2021-05-06T03:35:49Z] a58c2d6c || result = self.forward(*input, **kwargs)
[2021-05-06T03:35:49Z] a58c2d6c || File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/container.py", line 117, in forward
[2021-05-06T03:35:49Z] a58c2d6c || input = module(input)
[2021-05-06T03:35:49Z] a58c2d6c || File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
[2021-05-06T03:35:49Z] a58c2d6c || result = self.forward(*input, **kwargs)
[2021-05-06T03:35:49Z] a58c2d6c || File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 423, in forward
[2021-05-06T03:35:49Z] a58c2d6c || return self._conv_forward(input, self.weight)
[2021-05-06T03:35:49Z] a58c2d6c || File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 420, in _conv_forward
[2021-05-06T03:35:49Z] a58c2d6c || self.padding, self.dilation, self.groups)
[2021-05-06T03:35:49Z] a58c2d6c || RuntimeError: CUDA error: no kernel image is available for execution on the device
[2021-05-06T03:35:51Z] a58c2d6c || INFO: container failed with non-zero exit code: (exit code 1)
[2021-05-06T03:35:52Z] c5a9683c || INFO: Pod exp-1-trial-1-rank-0-a79a2198-9840-4bf8-b14c-cd8d5da6610e-epic-cowbird: Pod resources allocated.
[2021-05-06T03:35:53Z] c5a9683c || INFO: Pod exp-1-trial-1-rank-0-a79a2198-9840-4bf8-b14c-cd8d5da6610e-epic-cowbird: Container image "determinedai/environments:cuda-10.2-pytorch-1.7-tf-1.15-gpu-da9ba40" already present on machine
[2021-05-06T03:35:53Z] c5a9683c || INFO: Pod exp-1-trial-1-rank-0-a79a2198-9840-4bf8-b14c-cd8d5da6610e-epic-cowbird: Created container determined-init-container
[2021-05-06T03:35:53Z] c5a9683c || INFO: Pod exp-1-trial-1-rank-0-a79a2198-9840-4bf8-b14c-cd8d5da6610e-epic-cowbird: Started container determined-init-container
[2021-05-06T03:35:54Z] c5a9683c || INFO: Pod exp-1-trial-1-rank-0-a79a2198-9840-4bf8-b14c-cd8d5da6610e-epic-cowbird: Container image "fluent/fluent-bit:1.6" already present on machine
[2021-05-06T03:35:54Z] c5a9683c || INFO: Pod exp-1-trial-1-rank-0-a79a2198-9840-4bf8-b14c-cd8d5da6610e-epic-cowbird: Created container determined-fluent-container
[2021-05-06T03:35:55Z] c5a9683c || INFO: Pod exp-1-trial-1-rank-0-a79a2198-9840-4bf8-b14c-cd8d5da6610e-epic-cowbird: Started container determined-fluent-container
[2021-05-06T03:35:55Z] c5a9683c || INFO: Pod exp-1-trial-1-rank-0-a79a2198-9840-4bf8-b14c-cd8d5da6610e-epic-cowbird: Container image "determinedai/environments:cuda-10.2-pytorch-1.7-tf-1.15-gpu-da9ba40" already present on machine
[2021-05-06T03:35:57Z] c5a9683c || INFO: Pod exp-1-trial-1-rank-0-a79a2198-9840-4bf8-b14c-cd8d5da6610e-epic-cowbird: Created container determined-container
[2021-05-06T03:35:57Z] c5a9683c || INFO: Pod exp-1-trial-1-rank-0-a79a2198-9840-4bf8-b14c-cd8d5da6610e-epic-cowbird: Started container determined-container
[2021-05-06T03:36:00Z] c5a9683c || + WORKING_DIR=/run/determined/workdir
[2021-05-06T03:36:00Z] c5a9683c || + export PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
[2021-05-06T03:36:00Z] c5a9683c || + STARTUP_HOOK=startup-hook.sh
[2021-05-06T03:36:00Z] c5a9683c || + PATH=/run/determined/pythonuserbase/bin:/opt/conda/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
[2021-05-06T03:36:00Z] c5a9683c || + '[' -z '' ']'
[2021-05-06T03:36:00Z] c5a9683c || + export DET_PYTHON_EXECUTABLE=python3
[2021-05-06T03:36:00Z] c5a9683c || + DET_PYTHON_EXECUTABLE=python3
[2021-05-06T03:36:00Z] c5a9683c || + /bin/which python3
[2021-05-06T03:36:00Z] c5a9683c || + '[' /root = / ']'
[2021-05-06T03:36:00Z] c5a9683c || + python3 -m pip install -q --user /opt/determined/wheels/determined-0.15.2-py3-none-any.whl
[2021-05-06T03:36:00Z] c5a9683c || ERROR: determined 0.15.2 has requirement ruamel.yaml>=0.15.78, but you'll have ruamel-yaml 0.15.46 which is incompatible.
[2021-05-06T03:36:00Z] c5a9683c || + cd /run/determined/workdir
[2021-05-06T03:36:00Z] c5a9683c || + test -f startup-hook.sh
[2021-05-06T03:36:00Z] c5a9683c || + exec python3 -m determined.exec.harness
[2021-05-06T03:36:01Z] c5a9683c || INFO: New trial runner in (container c5a9683c-809a-4a1d-b28b-50782b5536c1) on agent k8agent: {'master_addr': '192.168.245.182', 'master_port': 8080, 'use_tls': 0, 'master_cert_file': None, 'master_cert_name': None, 'container_id': 'c5a9683c-809a-4a1d-b28b-50782b5536c1', 'experiment_config': {'description': 'mnist_pytorch_const', 'data': {'url': 'https://s3-us-west-2.amazonaws.com/determined-ai-test-data/pytorch_mnist.tar.gz'}, 'checkpoint_storage': {'host_path': '/checkpoints', 'save_experiment_best': 0, 'save_trial_best': 1, 'save_trial_latest': 1, 'type': 'shared_fs'}, 'perform_initial_validation': False, 'min_checkpoint_period': {'batches': 0}, 'min_validation_period': {'batches': 0}, 'checkpoint_policy': 'best', 'hyperparameters': {'dropout1': {'type': 'const', 'val': 0.25}, 'dropout2': {'type': 'const', 'val': 0.5}, 'global_batch_size': {'type': 'const', 'val': 64}, 'learning_rate': {'type': 'const', 'val': 1}, 'n_filters1': {'type': 'const', 'val': 32}, 'n_filters2': {'type': 'const', 'val': 64}}, 'searcher': {'max_length': {'batches': 937}, 'metric': 'validation_loss', 'name': 'single', 'smaller_is_better': True, 'source_checkpoint_uuid': None, 'source_trial_id': None}, 'resources': {'slots_per_trial': 1, 'weight': 1, 'native_parallel': False, 'agent_label': '', 'resource_pool': '', 'devices': None}, 'optimizations': {'aggregation_frequency': 1, 'average_aggregated_gradients': True, 'average_training_metrics': False, 'gradient_compression': False, 'mixed_precision': 'O0', 'tensor_fusion_threshold': 64, 'tensor_fusion_cycle_time': 5, 'auto_tune_tensor_fusion': False}, 'records_per_epoch': 0, 'scheduling_unit': 100, 'environment': {'image': {'cpu': 'determinedai/environments:py-3.7-pytorch-1.7-tf-1.15-cpu-da9ba40', 'gpu': 'determinedai/environments:cuda-10.2-pytorch-1.7-tf-1.15-gpu-da9ba40'}, 'environment_variables': {}, 'ports': None, 'force_pull_image': False, 'pod_spec': None, 'add_capabilities': None, 'drop_capabilities': None}, 'reproducibility': {'experiment_seed': 1620272038}, 'max_restarts': 5, 'debug': False, 'internal': None, 'entrypoint': 'model_def:MNistTrial', 'data_layer': {'container_storage_path': None, 'type': 'shared_fs'}, 'profiling': {'enabled': False, 'begin_on_batch': 0, 'end_after_batch': 0}}, 'hparams': {'dropout1': 0.25, 'dropout2': 0.5, 'global_batch_size': 64, 'learning_rate': 1, 'n_filters1': 32, 'n_filters2': 64}, 'initial_workload': <RUN_STEP (100 Batches): (1,1,1)>, 'latest_checkpoint': None, 'use_gpu': 1, 'container_gpus': ['GPU-a11059eb-0891-239b-9231-a055bf282a20'], 'slot_ids': [0], 'debug': False, 'workload_manager_type': 'TRIAL_WORKLOAD_MANAGER', 'det_rendezvous_ports': '1734,1750', 'det_trial_unique_port_offset': 0, 'det_trial_runner_network_interface': 'DET_AUTO_DETECT_NETWORK_INTERFACE', 'det_trial_id': '1', 'det_experiment_id': '1', 'det_cluster_id': '6d8bea20-a491-4451-953b-c3d093960076', 'trial_seed': 1001305543, 'managed_training': True, 'test_mode': False, 'on_cluster': True, '_per_slot_batch_size': 64, '_global_batch_size': 64}.
[2021-05-06T03:36:01Z] c5a9683c || INFO: Connecting to master at ws://192.168.245.182:8080/ws/trial/1/1/c5a9683c-809a-4a1d-b28b-50782b5536c1
[2021-05-06T03:36:01Z] c5a9683c || INFO: Connected to master
[2021-05-06T03:36:01Z] c5a9683c || INFO: Established WebSocket session with master
[2021-05-06T03:36:01Z] c5a9683c || INFO: Got rendezvous information: {'addrs': ['100.122.176.252:1734'], 'addrs2': ['100.122.176.252:1750'], 'containers': [{'addresses': [{'container_ip': '100.122.176.252', 'container_port': 1734, 'host_ip': '100.122.176.252', 'host_port': 1734}, {'container_ip': '100.122.176.252', 'container_port': 1750, 'host_ip': '100.122.176.252', 'host_port': 1750}]}], 'rank': 0, 'type': 'RENDEZVOUS_INFO'}
[2021-05-06T03:36:02Z] c5a9683c || INFO: Horovod config: {'use': False, 'aggregation_frequency': 1, 'fp16_compression': False, 'grad_updates_size_file': None, 'average_aggregated_gradients': True, 'average_training_metrics': False}.
[2021-05-06T03:36:02Z] c5a9683c || INFO: Loading Trial implementation with entrypoint model_def:MNistTrial.
[2021-05-06T03:36:03Z] c5a9683c || /opt/conda/lib/python3.7/site-packages/torch/cuda/__init__.py:104: UserWarning:
[2021-05-06T03:36:03Z] c5a9683c || The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70 sm_75.
[2021-05-06T03:36:03Z] c5a9683c || GeForce RTX 3090 with CUDA capability sm_86 is not compatible with the current PyTorch installation.
[2021-05-06T03:36:03Z] c5a9683c || If you want to use the GeForce RTX 3090 GPU with PyTorch, please check the instructions at https://pytorch.org/get-started/locally/
[2021-05-06T03:36:03Z] c5a9683c || warnings.warn(incompatible_device_warn.format(device_name, capability, " ".join(arch_list), device_name))
[2021-05-06T03:36:04Z] c5a9683c || INFO: Creating PyTorchTrialController with MNistTrial.
[2021-05-06T03:36:04Z] c5a9683c || INFO: Downloading https://s3-us-west-2.amazonaws.com/determined-ai-test-data/pytorch_mnist.tar.gz to /tmp/data-rank0/MNIST/pytorch_mnist.tar.gz
[2021-05-06T03:36:10Z] c5a9683c || INFO: Running workload <RUN_STEP (100 Batches): (1,1,1)>
[2021-05-06T03:36:10Z] c5a9683c || INFO: WebSocket closed
[2021-05-06T03:36:10Z] c5a9683c || INFO: Disconnected from master, exiting gracefully
[2021-05-06T03:36:10Z] c5a9683c || Traceback (most recent call last):
[2021-05-06T03:36:10Z] c5a9683c || File "/opt/conda/lib/python3.7/runpy.py", line 193, in _run_module_as_main
[2021-05-06T03:36:10Z] c5a9683c || "__main__", mod_spec)
[2021-05-06T03:36:10Z] c5a9683c || File "/opt/conda/lib/python3.7/runpy.py", line 85, in _run_code
[2021-05-06T03:36:10Z] c5a9683c || exec(code, run_globals)
[2021-05-06T03:36:10Z] c5a9683c || File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/exec/harness.py", line 236, in <module>
[2021-05-06T03:36:10Z] c5a9683c || main()
[2021-05-06T03:36:10Z] c5a9683c || File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/exec/harness.py", line 229, in main
[2021-05-06T03:36:10Z] c5a9683c || build_and_run_training_pipeline(env)
[2021-05-06T03:36:10Z] c5a9683c || File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/exec/harness.py", line 149, in build_and_run_training_pipeline
[2021-05-06T03:36:10Z] c5a9683c || controller.run()
[2021-05-06T03:36:10Z] c5a9683c || File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/pytorch/_pytorch_trial.py", line 152, in run
[2021-05-06T03:36:10Z] c5a9683c || w.total_batches_processed,
[2021-05-06T03:36:10Z] c5a9683c || File "/run/determined/pythonuserbase/lib/python3.7/site-packages/determined/pytorch/_pytorch_trial.py", line 310, in _train_for_step
[2021-05-06T03:36:10Z] c5a9683c || batch_idx=batch_idx,
[2021-05-06T03:36:10Z] c5a9683c || File "/run/determined/workdir/model_def.py", line 84, in train_batch
[2021-05-06T03:36:10Z] c5a9683c || output = self.model(data)
[2021-05-06T03:36:10Z] c5a9683c || File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
[2021-05-06T03:36:10Z] c5a9683c || result = self.forward(*input, **kwargs)
[2021-05-06T03:36:10Z] c5a9683c || File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/container.py", line 117, in forward
[2021-05-06T03:36:10Z] c5a9683c || input = module(input)
[2021-05-06T03:36:10Z] c5a9683c || File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
[2021-05-06T03:36:10Z] c5a9683c || result = self.forward(*input, **kwargs)
[2021-05-06T03:36:10Z] c5a9683c || File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 423, in forward
[2021-05-06T03:36:10Z] c5a9683c || File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 420, in _conv_forward
[2021-05-06T03:36:10Z] c5a9683c || return self._conv_forward(input, self.weight)
[2021-05-06T03:36:10Z] c5a9683c || self.padding, self.dilation, self.groups)
[2021-05-06T03:36:10Z] c5a9683c || RuntimeError: CUDA error: no kernel image is available for execution on the device
[2021-05-06T03:36:12Z] c5a9683c || INFO: container failed with non-zero exit code: (exit code 1)
๏ฟฝ[32mTrial log stream ended. To reopen log stream, run: det trial logs -f 1๏ฟฝ[0m