skyplane-project / skyplane Goto Github PK
View Code? Open in Web Editor NEW๐ฅ Blazing fast bulk data transfers between any cloud ๐ฅ
Home Page: https://skyplane.org
License: Apache License 2.0
๐ฅ Blazing fast bulk data transfers between any cloud ๐ฅ
Home Page: https://skyplane.org
License: Apache License 2.0
Hey Sam, is there a way we can deal with Azure resource groups in a better fashion?
Maybe just retain the resource groups and re-use them? Or even create one resource group and use that one only?
04:59:15 [DEBUG] Loaded gcp_project: skylark-shishir, azure_subscription: ab110d95-7b83-4cec-b9dc-400255f3166e
04:59:15 [WARN] Warning: soft file limit is set to 1024, increasing for process with `sudo prlimit --pid 7325 --nofile=1048576:1048576`
04:59:22 [DEBUG] Cloud SSH key initialization: 7.27s
04:59:22 [WARN] Instances will remain up and may result in continued cloud billing. Remember to call `skylark deprovision` to deprovision gateways.
04:59:25 [WARN] Warning: malformed Azure resource group skylark-azure-f658ffe734944cfabcb792d6a9614af5 found and ignored. You should go to the Microsoft Az
ure portal, investigate this manually, and delete any orphaned resources that may be allocated.
04:59:27 [WARN] Warning: malformed Azure resource group skylark-azure-b96e2902b57a49c786a5893755fac160 found and ignored. You should go to the Microsoft Az
ure portal, investigate this manually, and delete any orphaned resources that may be allocated.
04:59:29 [WARN] Warning: malformed Azure resource group skylark-azure-271cf3c639914dea9c4a55dad85c2a97 found and ignored. You should go to the Microsoft Az
ure portal, investigate this manually, and delete any orphaned resources that may be allocated.
04:59:30 [WARN] Warning: malformed Azure resource group skylark-azure-0eaf46df242e42f58e32ae444bcc4e5c found and ignored. You should go to the Microsoft Az
ure portal, investigate this manually, and delete any orphaned resources that may be allocated.
04:59:32 [WARN] Warning: malformed Azure resource group skylark-azure-d56a16bda7bc4c6586caf492ee5759b3 found and ignored. You should go to the Microsoft Az
ure portal, investigate this manually, and delete any orphaned resources that may be allocated.
04:59:34 [WARN] Warning: malformed Azure resource group skylark-azure-15d02bf2734d452da88bdde1f99a1354 found and ignored. You should go to the Microsoft Az
ure portal, investigate this manually, and delete any orphaned resources that may be allocated.
04:59:35 [WARN] Warning: malformed Azure resource group skylark-azure-83243f6d1d224aaf92c087345d9d4211 found and ignored. You should go to the Microsoft Az
ure portal, investigate this manually, and delete any orphaned resources that may be allocated.
04:59:37 [WARN] Warning: malformed Azure resource group skylark-azure-d577b6da64e54c8da11e1476ad316e82 found and ignored. You should go to the Microsoft Az
ure portal, investigate this manually, and delete any orphaned resources that may be allocated.
04:59:39 [WARN] Warning: malformed Azure resource group skylark-azure-4b678bb8d0a44a059b6c8016c3d23a76 found and ignored. You should go to the Microsoft Az
ure portal, investigate this manually, and delete any orphaned resources that may be allocated.
04:59:41 [WARN] Warning: malformed Azure resource group skylark-azure-0b1b2344bb364684bd644982a98c8681 found and ignored. You should go to the Microsoft Az
ure portal, investigate this manually, and delete any orphaned resources that may be allocated.
04:59:42 [WARN] Warning: malformed Azure resource group skylark-azure-751b38d4048d4155a3437497f60e2c21 found and ignored. You should go to the Microsoft Az
ure portal, investigate this manually, and delete any orphaned resources that may be allocated.
04:59:44 [WARN] Warning: malformed Azure resource group skylark-azure-a1d03fe6d8c1434096161391aa415030 found and ignored. You should go to the Microsoft Az
ure portal, investigate this manually, and delete any orphaned resources that may be allocated.```
Import issue when running skylark replicate-random
. Currently solution is to use conda instead.
(env) ubuntu@ip-172-31-52-37:~/skylark$ skylark replicate-random aws:ap-northeast-1 aws:eu-central-1 --chunk-size-mb 8 --n-chunks 1024 --num-gateways 1 --num-outgoing-connections 64
Traceback (most recent call last):
File "/home/ubuntu/skylark/env/bin/skylark", line 6, in <module>
from pkg_resources import load_entry_point
File "/home/ubuntu/skylark/env/lib/python3.8/site-packages/pkg_resources/__init__.py", line 3252, in <module>
def _initialize_master_working_set():
File "/home/ubuntu/skylark/env/lib/python3.8/site-packages/pkg_resources/__init__.py", line 3235, in _call_aside
f(*args, **kwargs)
File "/home/ubuntu/skylark/env/lib/python3.8/site-packages/pkg_resources/__init__.py", line 3264, in _initialize_master_working_set
working_set = WorkingSet._build_master()
File "/home/ubuntu/skylark/env/lib/python3.8/site-packages/pkg_resources/__init__.py", line 583, in _build_master
ws.require(__requires__)
File "/home/ubuntu/skylark/env/lib/python3.8/site-packages/pkg_resources/__init__.py", line 900, in require
needed = self.resolve(parse_requirements(requirements))
File "/home/ubuntu/skylark/env/lib/python3.8/site-packages/pkg_resources/__init__.py", line 786, in resolve
raise DistributionNotFound(req, requirers)
pkg_resources.DistributionNotFound: The 'grpcio-status<2.0dev,>=1.33.2; extra == "grpc"' distribution was not found and is required by google-api-core
Remaining todos:
One way to easily integrate configs would be to pass all of the config variables here from config.json. Feel free to open an issue linking this area of the code so we don't block this PR.
Originally posted by @parasj in #72 (comment)
While creating an IAM role for DataSync, it takes time for the permissions to propagate. We need to have a way to retry line 84
in skylark/skylark/cli/cli_aws.py
(base) ubuntu@sky1:~/skylark$ skylark aws cp-datasync imagenet-records-useast1 skylark-test-us-east-1 fake_imagenet
Creating datasync-role
IAM role ARN: arn:aws:iam::693554043898:role/datasync-role
Traceback (most recent call last):
File "/home/ubuntu/miniconda/bin/skylark", line 33, in <module>
sys.exit(load_entry_point('skylark', 'console_scripts', 'skylark')())
File "/home/ubuntu/miniconda/lib/python3.8/site-packages/typer/main.py", line 214, in __call__
return get_command(self)(*args, **kwargs)
File "/home/ubuntu/miniconda/lib/python3.8/site-packages/click/core.py", line 1128, in __call__
return self.main(*args, **kwargs)
File "/home/ubuntu/miniconda/lib/python3.8/site-packages/click/core.py", line 1053, in main
rv = self.invoke(ctx)
File "/home/ubuntu/miniconda/lib/python3.8/site-packages/click/core.py", line 1659, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/ubuntu/miniconda/lib/python3.8/site-packages/click/core.py", line 1659, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/ubuntu/miniconda/lib/python3.8/site-packages/click/core.py", line 1395, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/ubuntu/miniconda/lib/python3.8/site-packages/click/core.py", line 754, in invoke
return __callback(*args, **kwargs)
File "/home/ubuntu/miniconda/lib/python3.8/site-packages/typer/main.py", line 500, in wrapper
return callback(**use_params) # type: ignore
File "/home/ubuntu/skylark/skylark/cli/cli_aws.py", line 84, in cp_datasync
src_response = ds_client_src.create_location_s3(
File "/home/ubuntu/miniconda/lib/python3.8/site-packages/botocore/client.py", line 391, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/home/ubuntu/miniconda/lib/python3.8/site-packages/botocore/client.py", line 719, in _make_api_call
raise error_class(parsed_response, operation_name)
botocore.errorfactory.InvalidRequestException: An error occurred (InvalidRequestException) when calling the CreateLocationS3 operation: Unable to assume rol
e. Reason: Access denied when calling sts:AssumeRole; roleName=datasync-role, roleArn=arn:aws:iam::693554043898:role/datasync-role
When I ctrl-c, copies still occur in the background
In the presence of large number of instances, the skylark deprovision
command starts the decomission. But does not return, rather gets stuck in-between (more than 30 min).
Deprovisioning 30 instances
Deprovisioning (azure:westus): 53%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | 16/30 [04:19<03:02, 13.06s/it]
Status: broken as of #93
There seems to be an error in how we check for presence of obj stores. This wasn't happening before.
Command:
./scripts/experiment.sh azure:westus azure:eastus
Trace:
SKYLARK_DOCKER_IMAGE=ghcr.io/parasj/skylark:local-ab2b075eb2dda73a2cbcc9e9434d42b5
03:26:23 [WARN] Gurobi not installed, using CoinOR instead.
03:26:30 [DEBUG] Solve throughput problem: 6.38s
03:26:30 [DEBUG] Total cost: $0.0202 (egress: $0.0200, instance: $0.0002)
03:26:30 [DEBUG] Total throughput: [25.0] Gbps
03:26:30 [DEBUG] Total runtime: 0.32s
03:26:30 [DEBUG] Instance regions: [azure:eastus=2, azure:westus=2]
03:26:30 [DEBUG] Flow matrix:
03:26:30 [DEBUG] azure:westus -> azure:eastus: 25.00 Gbps with 128.0 connections, 64.0GB (link capacity = 24.91 Gbps)
/home/ubuntu/skylark/skylark/replicate/solver.py:317: RuntimeWarning: invalid value encountered in double_scalars
average_egress_conns = [np.ceil(solution.var_conn[i, :].sum() / solution.var_instances_per_region[i]) for i in range(len(regions))]
/home/ubuntu/skylark/skylark/replicate/solver.py:318: RuntimeWarning: invalid value encountered in double_scalars
average_ingress_conns = [np.ceil(solution.var_conn[:, i].sum() / solution.var_instances_per_region[i]) for i in range(len(regions))]
03:26:30 [WARN] azure:westus:0:0c -> azure:eastus (partial): 64.0c of 64.0c remaining
03:26:30 [WARN] azure:westus:1:0c -> azure:eastus: 64.0c of 0c remaining
03:26:30 [WARN] azure:westus:0:0c -> azure:eastus:0:0c: 64c remaining
03:26:30 [WARN] azure:westus:1:0c -> azure:eastus:1:0c: 64.0c remaining
03:26:30 [WARN] Scaling connections by 1.00x
data/plan/azure-westus_azure-eastus_8_fake_imagenet.json
03:26:31 [DEBUG] Loaded gcp_project: skylark-shishir, azure_subscription: ab110d95-7b83-4cec-b9dc-400255f3166e
03:26:31 [WARN] Warning: soft file limit is set to 1024, increasing for process with `sudo prlimit --pid 23364 --nofile=1048576:1048576`
03:26:33 [INFO] Searching for orphaned Azure resources...
03:26:38 [INFO] Done cleaning up orphaned Azure resources
03:26:38 [DEBUG] Cloud SSH key initialization: 6.08s
03:26:38 [WARN] Instances will remain up and may result in continued cloud billing. Remember to call `skylark deprovision` to deprovision gateways.
03:26:42 [DEBUG] Provisioning instances and waiting to boot: 0.00s
03:26:42 [DEBUG] Waiting for ab110d95-7b83-4cec-b9dc-400255f3166e:azure:westus:skylark-azure-3e860ac0553e418b91625ae4fd53764e to be ready
03:26:42 [DEBUG] Waiting for ab110d95-7b83-4cec-b9dc-400255f3166e:azure:westus:skylark-azure-30c7ab543cc044758046f366a24a80f5 to be ready
03:26:42 [DEBUG] Waiting for ab110d95-7b83-4cec-b9dc-400255f3166e:azure:eastus:skylark-azure-9feb4f96d68444c99d02c2a3662008f6 to be ready
03:26:42 [DEBUG] Waiting for ab110d95-7b83-4cec-b9dc-400255f3166e:azure:eastus:skylark-azure-4c72ff9e8c304dc993dca5666d03b998 to be ready
03:26:43 [DEBUG] ab110d95-7b83-4cec-b9dc-400255f3166e:azure:eastus:skylark-azure-9feb4f96d68444c99d02c2a3662008f6 is ready
03:26:43 [DEBUG] ab110d95-7b83-4cec-b9dc-400255f3166e:azure:eastus:skylark-azure-4c72ff9e8c304dc993dca5666d03b998 is ready
03:26:43 [DEBUG] ab110d95-7b83-4cec-b9dc-400255f3166e:azure:westus:skylark-azure-3e860ac0553e418b91625ae4fd53764e is ready
03:26:43 [DEBUG] ab110d95-7b83-4cec-b9dc-400255f3166e:azure:westus:skylark-azure-30c7ab543cc044758046f366a24a80f5 is ready
03:26:45 [DEBUG] Starting gateway ab110d95-7b83-4cec-b9dc-400255f3166e:azure:eastus:skylark-azure-9feb4f96d68444c99d02c2a3662008f6, host: 20.106.252.73: Ins
talling docker
03:26:45 [DEBUG] Starting gateway ab110d95-7b83-4cec-b9dc-400255f3166e:azure:eastus:skylark-azure-4c72ff9e8c304dc993dca5666d03b998, host: 20.106.249.118: In
stalling docker
03:26:45 [DEBUG] Starting gateway ab110d95-7b83-4cec-b9dc-400255f3166e:azure:westus:skylark-azure-3e860ac0553e418b91625ae4fd53764e, host: 104.210.32.190: In
stalling docker
03:26:45 [DEBUG] Starting gateway ab110d95-7b83-4cec-b9dc-400255f3166e:azure:westus:skylark-azure-30c7ab543cc044758046f366a24a80f5, host: 104.210.39.87: Ins
talling docker
03:26:46 [DEBUG] Starting gateway ab110d95-7b83-4cec-b9dc-400255f3166e:azure:eastus:skylark-azure-4c72ff9e8c304dc993dca5666d03b998, host: 20.106.249.118: St
arting monitoring
03:26:46 [DEBUG] Starting gateway ab110d95-7b83-4cec-b9dc-400255f3166e:azure:eastus:skylark-azure-9feb4f96d68444c99d02c2a3662008f6, host: 20.106.252.73: Sta
rting monitoring
03:26:47 [DEBUG] Starting gateway ab110d95-7b83-4cec-b9dc-400255f3166e:azure:eastus:skylark-azure-4c72ff9e8c304dc993dca5666d03b998, host: 20.106.249.118: Pu
lling docker image
03:26:47 [DEBUG] Starting gateway ab110d95-7b83-4cec-b9dc-400255f3166e:azure:eastus:skylark-azure-9feb4f96d68444c99d02c2a3662008f6, host: 20.106.252.73: Pul
ling docker image
03:26:47 [DEBUG] Starting gateway ab110d95-7b83-4cec-b9dc-400255f3166e:azure:westus:skylark-azure-30c7ab543cc044758046f366a24a80f5, host: 104.210.39.87: Sta
rting monitoring
03:26:48 [DEBUG] Starting gateway ab110d95-7b83-4cec-b9dc-400255f3166e:azure:westus:skylark-azure-3e860ac0553e418b91625ae4fd53764e, host: 104.210.32.190: St
arting monitoring
03:26:48 [DEBUG] Starting gateway ab110d95-7b83-4cec-b9dc-400255f3166e:azure:westus:skylark-azure-30c7ab543cc044758046f366a24a80f5, host: 104.210.39.87: Pul
ling docker image
03:26:49 [DEBUG] Starting gateway ab110d95-7b83-4cec-b9dc-400255f3166e:azure:westus:skylark-azure-3e860ac0553e418b91625ae4fd53764e, host: 104.210.32.190: Pu
lling docker image
03:26:49 [ERROR] Failed to read AWS credentials locally No section: 'default'
03:26:49 [DEBUG] Starting gateway ab110d95-7b83-4cec-b9dc-400255f3166e:azure:eastus:skylark-azure-4c72ff9e8c304dc993dca5666d03b998, host: 20.106.249.118: St
arting gateway container ghcr.io/parasj/skylark:local-ab2b075eb2dda73a2cbcc9e9434d42b5
03:26:49 [ERROR] Failed to read AWS credentials locally No section: 'default'
03:26:49 [DEBUG] Starting gateway ab110d95-7b83-4cec-b9dc-400255f3166e:azure:eastus:skylark-azure-9feb4f96d68444c99d02c2a3662008f6, host: 20.106.252.73: Sta
rting gateway container ghcr.io/parasj/skylark:local-ab2b075eb2dda73a2cbcc9e9434d42b5
03:26:50 [DEBUG] Starting gateway ab110d95-7b83-4cec-b9dc-400255f3166e:azure:eastus:skylark-azure-4c72ff9e8c304dc993dca5666d03b998, host: 20.106.249.118: Ga
teway started ed486c186ad74bfb94e2b7308f505b5178efb893444481de1c0d908d4b5c4cd5
03:26:50 [DEBUG] Starting gateway ab110d95-7b83-4cec-b9dc-400255f3166e:azure:eastus:skylark-azure-9feb4f96d68444c99d02c2a3662008f6, host: 20.106.252.73: Gat
eway started d4bfccca38054fb09af98769b0bfe5c4961328b49db1d229244fa0f319daba64
03:26:52 [ERROR] Failed to read AWS credentials locally No section: 'default'
03:26:52 [DEBUG] Starting gateway ab110d95-7b83-4cec-b9dc-400255f3166e:azure:westus:skylark-azure-30c7ab543cc044758046f366a24a80f5, host: 104.210.39.87: Sta
rting gateway container ghcr.io/parasj/skylark:local-ab2b075eb2dda73a2cbcc9e9434d42b5
03:26:52 [ERROR] Failed to read AWS credentials locally No section: 'default'
03:26:52 [DEBUG] Starting gateway ab110d95-7b83-4cec-b9dc-400255f3166e:azure:westus:skylark-azure-3e860ac0553e418b91625ae4fd53764e, host: 104.210.32.190: St
arting gateway container ghcr.io/parasj/skylark:local-ab2b075eb2dda73a2cbcc9e9434d42b5
03:26:53 [DEBUG] Starting gateway ab110d95-7b83-4cec-b9dc-400255f3166e:azure:westus:skylark-azure-30c7ab543cc044758046f366a24a80f5, host: 104.210.39.87: Gat
eway started 138dcf9be7f63f411f4080d16d89884df30d7c31718681c8f4aa4fa31557a32f
03:26:53 [DEBUG] Starting gateway ab110d95-7b83-4cec-b9dc-400255f3166e:azure:westus:skylark-azure-3e860ac0553e418b91625ae4fd53764e, host: 104.210.32.190: Ga
teway started 849f977b19c1473da01a0a4c4612f37570a99def0a423226014314a6e6a03b59
03:26:54 [DEBUG] Install gateway package on instances: 12.42s
03:26:54 [INFO] Provisioned ReplicationTopologyGateway(region='azure:westus', instance=0): http://104.210.32.190:8888/container/849f977b19c1
03:26:54 [INFO] Provisioned ReplicationTopologyGateway(region='azure:westus', instance=1): http://104.210.39.87:8888/container/138dcf9be7f6
03:26:54 [INFO] Provisioned ReplicationTopologyGateway(region='azure:eastus', instance=0): http://20.106.252.73:8888/container/d4bfccca3805
03:26:54 [INFO] Provisioned ReplicationTopologyGateway(region='azure:eastus', instance=1): http://20.106.249.118:8888/container/ed486c186ad7
Query object sizes (fake_imagenet/validation-00127-of-00128): 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 1152/1152 [00:05<00:00, 203.47it/s]
03:27:01 [DEBUG] Building chunk requests: 0.00s
03:27:01 [DEBUG] Sending 1152 chunk requests to 104.210.32.190
03:27:02 [DEBUG] Dispatch chunk requests: 1.12s
03:27:02 [WARN] ==>Container already exists. Deletion started. Try restarting after sufficient gap
03:27:02 [WARN] ==> Alternatively use a diff bucket name with `--bucket-prefix`**
**
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
File "/home/ubuntu/miniconda/lib/python3.8/concurrent/futures/thread.py", line 40, in _python_exit
t.join()
File "/home/ubuntu/miniconda/lib/python3.8/threading.py", line 1011, in join
self._wait_for_tstate_lock()
File "/home/ubuntu/miniconda/lib/python3.8/threading.py", line 1027, in _wait_for_tstate_lock
elif lock.acquire(block, timeout):
Related to #146 I think?
23:59:47 [DEBUG] Provisioning instances and waiting to boot: 215.36s
Traceback (most recent call last):
File "/home/ubuntu/miniconda/lib/python3.8/site-packages/azure/core/polling/base_polling.py", line 483, in run
self._poll()
File "/home/ubuntu/miniconda/lib/python3.8/site-packages/azure/core/polling/base_polling.py", line 527, in _poll
_raise_if_bad_http_status_and_method(self._pipeline_response.http_response)
File "/home/ubuntu/miniconda/lib/python3.8/site-packages/azure/core/polling/base_polling.py", line 112, in _raise_if_bad_http_status_and_method
raise BadStatus(
azure.core.polling.base_polling.BadStatus: Invalid return status 404 for 'GET' operation
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ubuntu/miniconda/bin/skylark", line 33, in <module>
sys.exit(load_entry_point('skylark', 'console_scripts', 'skylark')())
File "/home/ubuntu/miniconda/lib/python3.8/site-packages/typer/main.py", line 214, in __call__
return get_command(self)(*args, **kwargs)
File "/home/ubuntu/miniconda/lib/python3.8/site-packages/click/core.py", line 1128, in __call__
return self.main(*args, **kwargs)
File "/home/ubuntu/miniconda/lib/python3.8/site-packages/click/core.py", line 1053, in main
rv = self.invoke(ctx)
File "/home/ubuntu/miniconda/lib/python3.8/site-packages/click/core.py", line 1659, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/ubuntu/miniconda/lib/python3.8/site-packages/click/core.py", line 1395, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/ubuntu/miniconda/lib/python3.8/site-packages/click/core.py", line 754, in invoke
return __callback(*args, **kwargs)
File "/home/ubuntu/miniconda/lib/python3.8/site-packages/typer/main.py", line 500, in wrapper
return callback(**use_params) # type: ignore
File "/home/ubuntu/skylark/skylark/cli/cli.py", line 245, in replicate_json
rc.provision_gateways(reuse_gateways)
File "/home/ubuntu/skylark/skylark/replicate/replicator_client.py", line 148, in provision_gateways
results = do_parallel(
File "/home/ubuntu/skylark/skylark/utils/utils.py", line 72, in do_parallel
args, result = future.result()
File "/home/ubuntu/miniconda/lib/python3.8/concurrent/futures/_base.py", line 437, in result
return self.__get_result()
File "/home/ubuntu/miniconda/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
raise self._exception
File "/home/ubuntu/miniconda/lib/python3.8/concurrent/futures/thread.py", line 57, in run
result = self.fn(*self.args, **self.kwargs)
File "/home/ubuntu/skylark/skylark/utils/utils.py", line 65, in wrapped_fn
return args, func(args)
File "/home/ubuntu/skylark/skylark/replicate/replicator_client.py", line 139, in provision_gateway_instance
server = self.azure.provision_instance(subregion, self.azure_instance_class)
File "/home/ubuntu/skylark/skylark/compute/azure/azure_cloud_provider.py", line 409, in provision_instance
nic_result = poller.result()
File "/home/ubuntu/miniconda/lib/python3.8/site-packages/azure/core/polling/_poller.py", line 255, in result
self.wait(timeout)
File "/home/ubuntu/miniconda/lib/python3.8/site-packages/azure/core/tracing/decorator.py", line 83, in wrapper_use_tracer
return func(*args, **kwargs)
File "/home/ubuntu/miniconda/lib/python3.8/site-packages/azure/core/polling/_poller.py", line 275, in wait
raise self._exception # type: ignore
File "/home/ubuntu/miniconda/lib/python3.8/site-packages/azure/core/polling/_poller.py", line 192, in _start
self._polling_method.run()
File "/home/ubuntu/miniconda/lib/python3.8/site-packages/azure/core/polling/base_polling.py", line 487, in run
raise HttpResponseError(
azure.core.exceptions.HttpResponseError: (ResourceNotFound) The Resource 'Microsoft.Network/networkInterfaces/skylark-azure-09d7c8e030714e3faca55446a6170e1f
-vm-nic' under resource group 'skylark' was not found. For more details please go to https://aka.ms/ARMResourceNotFoundFix
Code: ResourceNotFound
Message: The Resource 'Microsoft.Network/networkInterfaces/skylark-azure-09d7c8e030714e3faca55446a6170e1f-vm-nic' under resource group 'skylark' was not fou
nd. For more details please go to https://aka.ms/ARMResourceNotFoundFix
Command: ./scripts/experiment.sh aws:us-east-1 azure:westus
When the obj transfers are ongoing a certain resource (source/destination buxket) may get deleted. In such a scenario we need to be able to a) Capture and handle the exception gracefully and prevent the replicator being stalled, b) Inform the user of the stack trace. Given the current architecture the stack trace could be written to the logs on the gateways but is not printed on the user terminal.
See below run
(base) ubuntu@ip-172-31-82-174:~/skylark$ source scripts/pack_docker.sh && python skylark/test/test_replicator_client.py --gateway-docker-image $SKYLARK_DOCKER_IMAGE --n-chunks 100 --chunk-size-mb 4 --num-gateways 1 --src-region azure:eastus --dest-region azure:westus --bucket-prefix test1234567890
Building docker image
[+] Building 2.6s (22/22) FINISHED
=> [internal] load build definition from Dockerfile 0.0s
=> => transferring dockerfile: 990B 0.0s
=> [internal] load .dockerignore 0.0s
=> => transferring context: 434B 0.0s
=> resolve image config for docker.io/docker/dockerfile:1 0.1s
=> CACHED docker-image://docker.io/docker/dockerfile:1@sha256:42399d4635eddd7a9b8a24be879d2f9a930d0ed040a61324cfdf59ef1357b3b2 0.0s
=> [internal] load .dockerignore 0.0s
=> [internal] load build definition from Dockerfile 0.0s
=> [internal] load metadata for docker.io/library/python:3.8-slim 0.1s
=> [stage-0 1/13] FROM docker.io/library/python:3.8-slim@sha256:95240f5291de3193c1299c5b2513f9bb99ecdae0fabd137156b0fb8c47afd6 0.0s
=> [internal] load build context 0.1s
=> => transferring context: 7.99MB 0.1s
=> CACHED [stage-0 2/13] RUN echo 'net.ipv4.ip_local_port_range = 12000 65535' >> /etc/sysctl.conf 0.0s
=> CACHED [stage-0 3/13] RUN echo 'fs.file-max = 1048576' >> /etc/sysctl.conf 0.0s
=> CACHED [stage-0 4/13] RUN mkdir -p /etc/security/ 0.0s
=> CACHED [stage-0 5/13] RUN echo '* soft nofile 1048576' >> /etc/security/limits.conf 0.0s
=> CACHED [stage-0 6/13] RUN echo '* hard nofile 1048576' >> /etc/security/limits.conf 0.0s
=> CACHED [stage-0 7/13] RUN echo 'root soft nofile 1048576' >> /etc/security/limits.conf 0.0s
=> CACHED [stage-0 8/13] RUN echo 'root hard nofile 1048576' >> /etc/security/limits.conf 0.0s
=> CACHED [stage-0 9/13] COPY scripts/requirements-gateway.txt /tmp/requirements-gateway.txt 0.0s
=> CACHED [stage-0 10/13] RUN --mount=type=cache,target=/root/.cache/pip pip install --no-cache-dir --compile -r /tmp/requireme 0.0s
=> CACHED [stage-0 11/13] WORKDIR /pkg 0.0s
=> [stage-0 12/13] COPY . . 0.1s
=> [stage-0 13/13] RUN pip install -e . 2.0s
=> exporting to image 0.1s
=> => exporting layers 0.1s
=> => writing image sha256:4c751e3f4bc3572597df2bd2a147e72aebbc3199c46eb7bffed15f23a45e0604 0.0s
=> => naming to docker.io/library/skylark 0.0s
Uploading docker image to ghcr.io/parasj/skylark:local-0ac2daa948e124df4ab7aa4d5c445816
The push refers to repository [ghcr.io/parasj/skylark]
fb0a04656026: Pushed
2426c1884d13: Pushed
d5089b43742f: Layer already exists
b485462ae282: Layer already exists
d65daa64516b: Layer already exists
58dfc2b4e272: Layer already exists
8d558f297057: Layer already exists
264394ca8d1f: Layer already exists
42a596f0a6b9: Layer already exists
e3528ada1b37: Layer already exists
fe022393ef8d: Layer already exists
e4ab298cd14a: Layer already exists
ec8e20bb6d54: Layer already exists
51f094ff7b94: Layer already exists
1a40cb2669f8: Layer already exists
32034715e5d4: Layer already exists
7d0ebbe3f5d2: Layer already exists
local-0ac2daa948e124df4ab7aa4d5c445816: digest: sha256:0778f6c5df5cb3fa9134013eaf1ead2b003394569a18f2ea188f6b3b71d4d1e8 size: 3864
Deleted build cache objects:
e5pq85w786mj42tdk56afje4l
cmquw1liqm37jk29mw9bh3jnq
wgvaunwqmeluwarhmr2ubh1n0
Total reclaimed space: 7.971MB
SKYLARK_DOCKER_IMAGE=ghcr.io/parasj/skylark:local-0ac2daa948e124df4ab7aa4d5c445816
=================================================
______ _ _ _
/ _____)| | | | | |
( (____ | | _ _ _ | | _____ ____ | | _
\____ \ | |_/ )| | | || | (____ | / ___)| |_/ )
_____) )| _ ( | |_| || | / ___ || | | _ (
(______/ |_| \_) \__ | \_)\_____||_| |_| \_)
(____/
=================================================
09:03:56 [INFO] Not skipping upload, source bucket is test1234567890-skylark-eastus, destination bucket is test1234567890-skylark-westus
09:03:56 [INFO] Creating test objects
09:03:57 [INFO] Uploading 100 to bucket test1234567890-skylark-eastus
09:03:57 [INFO] Creating replication client
09:03:57 [DEBUG] Loaded gcp_project: skylark-333700, azure_subscription: ab110d95-7b83-4cec-b9dc-400255f3166e
09:04:05 [DEBUG] Cloud SSH key initialization: 7.58s
09:04:05 [INFO] Provisioning gateway instances
09:04:08 [DEBUG] Provisioning instances and waiting to boot: 0.00s
09:04:09 [DEBUG] Starting gateway ab110d95-7b83-4cec-b9dc-400255f3166e:azure:westus:skylark-azure-4e90f72dd70146ab931f76fbbf7d2121, host: 23.101.204.63: Installing docker
09:04:09 [DEBUG] Starting gateway ab110d95-7b83-4cec-b9dc-400255f3166e:azure:eastus:skylark-azure-9cc892cf8ddf47d5bf8f495602674bfc, host: 20.115.111.168: Installing docker
09:04:11 [DEBUG] Starting gateway ab110d95-7b83-4cec-b9dc-400255f3166e:azure:eastus:skylark-azure-9cc892cf8ddf47d5bf8f495602674bfc, host: 20.115.111.168: Starting monitoring
09:04:11 [DEBUG] Starting gateway ab110d95-7b83-4cec-b9dc-400255f3166e:azure:westus:skylark-azure-4e90f72dd70146ab931f76fbbf7d2121, host: 23.101.204.63: Starting monitoring
09:04:13 [DEBUG] Starting gateway ab110d95-7b83-4cec-b9dc-400255f3166e:azure:eastus:skylark-azure-9cc892cf8ddf47d5bf8f495602674bfc, host: 20.115.111.168: Pulling docker image
09:04:13 [DEBUG] Starting gateway ab110d95-7b83-4cec-b9dc-400255f3166e:azure:westus:skylark-azure-4e90f72dd70146ab931f76fbbf7d2121, host: 23.101.204.63: Pulling docker image
09:04:15 [DEBUG] Starting gateway ab110d95-7b83-4cec-b9dc-400255f3166e:azure:eastus:skylark-azure-9cc892cf8ddf47d5bf8f495602674bfc, host: 20.115.111.168: Starting gateway container ghcr.io/parasj/skylark:local-0ac2daa948e124df4ab7aa4d5c445816
09:04:16 [DEBUG] Starting gateway ab110d95-7b83-4cec-b9dc-400255f3166e:azure:eastus:skylark-azure-9cc892cf8ddf47d5bf8f495602674bfc, host: 20.115.111.168: Gateway started e1dfbaaa45ca7e22785516b162cefcb9c736419d053702c75b75b201cacf0f8d
09:04:20 [DEBUG] Starting gateway ab110d95-7b83-4cec-b9dc-400255f3166e:azure:westus:skylark-azure-4e90f72dd70146ab931f76fbbf7d2121, host: 23.101.204.63: Starting gateway container ghcr.io/parasj/skylark:local-0ac2daa948e124df4ab7aa4d5c445816
09:04:24 [DEBUG] Starting gateway ab110d95-7b83-4cec-b9dc-400255f3166e:azure:westus:skylark-azure-4e90f72dd70146ab931f76fbbf7d2121, host: 23.101.204.63: Gateway started 103a36fada2088adf5e66d505f0d09cf41cd31cbbfb2dd5a33e2f67beb5e967b
09:04:24 [DEBUG] Install gateway package on instances: 16.25s
09:04:24 [INFO] Provisioned ReplicationTopologyGateway(region='azure:westus', instance=0): http://23.101.204.63:8888/container/103a36fada20
09:04:24 [INFO] Provisioned ReplicationTopologyGateway(region='azure:eastus', instance=0): http://20.115.111.168:8888/container/e1dfbaaa45ca
Query object sizes (/test/direct_replication/99): 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 100/100 [00:00<00:00, 427.80it/s]
09:04:24 [DEBUG] Building chunk requests: 0.00s
09:04:24 [DEBUG] Sending 100 chunk requests to 20.115.111.168
09:04:25 [DEBUG] Dispatch chunk requests: 0.32s
09:04:25 [INFO] 0.39GByte replication job launched
Replication: average 0.25Gbit/s: 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 3.12G/3.12G [00:13<00:00, 251Mbit/s]
09:04:38 [INFO] Replication completed in 12.68s (0.25Gbit/s)
When the assertion fails, we need a better mechanism to deal with it. Rite now, it just continues.
Traceback (most recent call last):
File "scripts/setup_bucket.py", line 94, in <module>
main(parse_args())
File "scripts/setup_bucket.py", line 50, in main
obj_store_interface_src = ObjectStoreInterface.create(args.src_region, src_bucket)
File "/home/ubuntu/skylark/skylark/obj_store/object_store_interface.py", line 53, in create
return AzureInterface(region_tag.split(":")[1], bucket)
File "/home/ubuntu/skylark/skylark/obj_store/azure_interface.py", line 23, in __init__
assert self.azure_region in azure_storage_credentials
AssertionError
Building docker image
[+] Building 3.1s (19/19) FINISHED
OS: Mac OS 10.15.7
Command: skylark replicate-random aws:us-east-1 aws:us-west-1
Error trace:
sudo: prlimit: command not found
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.9/bin/skylark", line 33, in <module>
sys.exit(load_entry_point('skylark', 'console_scripts', 'skylark')())
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/typer/main.py", line 214, in __call__
return get_command(self)(*args, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/click/core.py", line 1128, in __call__
return self.main(*args, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/click/core.py", line 1053, in main
rv = self.invoke(ctx)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/click/core.py", line 1659, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/click/core.py", line 1395, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/click/core.py", line 754, in invoke
return __callback(*args, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/typer/main.py", line 500, in wrapper
return callback(**use_params) # type: ignore
File "/Users/asimbiswal/Desktop/Cal/RISELab/skylark/skylark/skylark/cli/cli.py", line 134, in replicate_random
check_ulimit()
File "/Users/asimbiswal/Desktop/Cal/RISELab/skylark/skylark/skylark/cli/cli_helper.py", line 241, in check_ulimit
subprocess.check_output(increase_soft_limit)
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/subprocess.py", line 424, in check_output
return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/subprocess.py", line 528, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['sudo', 'prlimit', '--pid', '1581', '--nofile=1048576:1048576']' returned non-zero exit status 1.
Implement copy_local_azure
, copy_azure_local
, copy_local_gcs
, copy_gcs_local
in skylark/cli/cli_helper.py
. Refer comment in #99
See the following output from a gateway log when running the basic test in the readme:
today at 8:06:39 PMProcess Process-55:
today at 8:06:39 PMTraceback (most recent call last):
today at 8:06:39 PM File "/usr/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
today at 8:06:39 PM self.run()
today at 8:06:39 PM File "/usr/lib/python3.8/multiprocessing/process.py", line 108, in run
today at 8:06:39 PM self._target(*self._args, **self._kwargs)
today at 8:06:39 PM File "/pkg/skylark/gateway/gateway_sender.py", line 60, in worker_loop
today at 8:06:39 PM self.send_chunks([next_chunk_id], dest_ip)
today at 8:06:39 PM File "/pkg/skylark/gateway/gateway_sender.py", line 128, in send_chunks
today at 8:06:39 PM sock.sendall(chunk_data)
today at 8:06:39 PMConnectionResetError: [Errno 104] Connection reset by peer
Based on my reading of the code, it looks like the connection isn't persistent --- the destination port is left listening, but a separate connection is still used for each batch of chunks.
I'm approving since it improves bandwidth, but I still think that keeping the TCP connection itself alive for a long time is good, to avoid ramping up congestion control each time.
Originally posted by @samkumar in #38 (review)
~/.skylark/config.json
~/.ssh/skylark
instance-initiated-shutdown-behavior terminate
).skylark init
should support disabling specific cloud providers if users do not have accessskylark
VPC instead of using the default VPCTriggered by running skylark replicate-random on main branch:
09:19:10 [DEBUG] Loaded gcp_project: None, azure_subscription: None
09:19:10 [WARN] Warning: soft file limit is set to 1024, increasing for process with `sudo prlimit --pid 34921 --nofile=1048576:1048576`
09:19:17 [DEBUG] Cloud SSH key initialization: 6.51s
09:19:17 [WARN] Instances will remain up and may result in continued cloud billing. Remember to call `skylark deprovision` to deprovision gateways.
09:19:21 [WARN] Retrying start_instance due to An error occurred (Unsupported) when calling the RunInstances operation: Your requested instance type (m5.8xlarge) is not supported in your requested Availability Zone (us-east-1e). Please retry your request by not specifying an Availability Zone or choosing us-east-1a, us-east-1b, us-east-1c, us-east-1d, us-east-1f. (attempt 1/4)
09:19:23 [WARN] Retrying start_instance due to An error occurred (Unsupported) when calling the RunInstances operation: Your requested instance type (m5.8xlarge) is not supported in your requested Availability Zone (us-east-1e). Please retry your request by not specifying an Availability Zone or choosing us-east-1a, us-east-1b, us-east-1c, us-east-1d, us-east-1f. (attempt 2/4)
09:19:27 [WARN] Retrying start_instance due to An error occurred (Unsupported) when calling the RunInstances operation: Your requested instance type (m5.8xlarge) is not supported in your requested Availability Zone (us-east-1e). Please retry your request by not specifying an Availability Zone or choosing us-east-1a, us-east-1b, us-east-1c, us-east-1d, us-east-1f. (attempt 3/4)
We can gain a lot of insights about the kind of data users are moving by logging their workloads to a central analytics store. We should log things like:
Status:
Querying for objects should start in parallel after the src bucket is ready. We don't have to wait till all the destination gateways are ready.
Command: ./scripts/experiment.sh aws:us-east-1 gcp:us-east1-b
00:12:31 [INFO] Provisioned ReplicationTopologyGateway(region='aws:us-east-1', instance=5): http://3.81.207.255:8888/container/34bae3e826c6
Query object sizes (fake_imagenet/validation-00127-of-00128): 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 1002/1002 [00:14<00:00, 67.88it/s]
00:12:46 [DEBUG] Building chunk requests: 0.00s
00:12:46 [DEBUG] Sending 154 chunk requests to 35.168.114.149
00:12:46 [DEBUG] Sending 154 chunk requests to 3.84.207.52
00:12:46 [DEBUG] Sending 154 chunk requests to 54.210.76.125
00:12:46 [DEBUG] Sending 154 chunk requests to 54.162.89.171
00:12:46 [DEBUG] Sending 386 chunk requests to 174.129.108.137
00:12:46 [DEBUG] Dispatch chunk requests: 0.17s
00:12:46 [INFO] 60.55GByte replication job launched
Replication: average 0.00Gbit/s: 0%| | 0.00/484G [00:20<?, ?bit/s]
00:13:06 [ERROR] No chunks completed after 20s! There is probably a bug, check logs. Exiting...
Replication: average 0.00Gbit/s: 0%| | 0.00/484G [00:20<?, ?bit/s]
{"total_runtime_s": 20.221991, "throughput_gbits": 0.0, "monitor_status": "timed_out", "success": false}
aws-us-east-1_gcp-us-east1-b_8_fake_imagenet
We should either delete the bucket after each experiment, or find a way to reuse the bucket. This might be happening since buckets need to be uniquely named.
(base) ubuntu@sky1:~/skylark/scripts$ ./experiment.sh aws:us-east-1 aws:us-west-1
experiments-skylark-us-east-1
experiments-skylark-us-west-1
data/plan/aws-us-east-1_aws-us-west-1_8_fake_imagenet.json
experiments-skylark-us-east-1
experiments-skylark-us-west-1
Traceback (most recent call last):
File "setup_bucket.py", line 96, in <module>
main(parse_args())
File "setup_bucket.py", line 53, in main
obj_store_interface_src.create_bucket()
File "/home/ubuntu/skylark/skylark/obj_store/s3_interface.py", line 76, in create_bucket
s3_client.create_bucket(Bucket=self.bucket_name)
File "/home/ubuntu/miniconda/lib/python3.8/site-packages/botocore/client.py", line 391, in _api_call
return self._make_api_call(operation_name, kwargs)
File "/home/ubuntu/miniconda/lib/python3.8/site-packages/botocore/client.py", line 719, in _make_api_call
raise error_class(parsed_response, operation_name)
botocore.errorfactory.BucketAlreadyExists: An error occurred (BucketAlreadyExists) when calling the CreateBucket operation: The requested bucket name is not
available. The bucket namespace is shared by all users of the system. Please select a different name and try again.
./experiment.sh: line 41: scripts/pack_docker.sh: No such file or directory```
We manually parse region tag strings which is confusing and sometimes inconsistent. Ideally, we should use something like a dataclass.
See: #169 (comment)_
Will likely be faster and more reliable for platforms like Azure: https://www.hashicorp.com/blog/cdk-for-terraform-enabling-python-and-typescript-support
Traceback (most recent call last):
File "/home/ubuntu/miniconda/bin/skylark", line 33, in <module>
sys.exit(load_entry_point('skylark', 'console_scripts', 'skylark')())
File "/home/ubuntu/miniconda/lib/python3.8/site-packages/typer/main.py", line 214, in __call__
return get_command(self)(*args, **kwargs)
File "/home/ubuntu/miniconda/lib/python3.8/site-packages/click/core.py", line 1128, in __call__
return self.main(*args, **kwargs)
File "/home/ubuntu/miniconda/lib/python3.8/site-packages/click/core.py", line 1053, in main
rv = self.invoke(ctx)
File "/home/ubuntu/miniconda/lib/python3.8/site-packages/click/core.py", line 1659, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/ubuntu/miniconda/lib/python3.8/site-packages/click/core.py", line 1395, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/ubuntu/miniconda/lib/python3.8/site-packages/click/core.py", line 754, in invoke
return __callback(*args, **kwargs)
File "/home/ubuntu/miniconda/lib/python3.8/site-packages/typer/main.py", line 500, in wrapper
return callback(**use_params) # type: ignore
File "/home/ubuntu/skylark/skylark/cli/cli.py", line 282, in deprovision
deprovision_skylark_instances(azure_subscription=azure_subscription, gcp_project_id=gcp_project)
File "/home/ubuntu/skylark/skylark/cli/cli_helper.py", line 267, in deprovision_skylark_instances
instances += gcp.get_matching_instances()
File "/home/ubuntu/skylark/skylark/compute/gcp/gcp_cloud_provider.py", line 164, in get_matching_instances
instances: List[Server] = super().get_matching_instances(**kwargs)
File "/home/ubuntu/skylark/skylark/compute/cloud_providers.py", line 60, in get_matching_instances
if not all(instance.tags().get(k, "") == v for k, v in tags.items()):
File "/home/ubuntu/skylark/skylark/compute/cloud_providers.py", line 60, in <genexpr>
if not all(instance.tags().get(k, "") == v for k, v in tags.items()):
File "/home/ubuntu/miniconda/lib/python3.8/site-packages/cachetools/__init__.py", line 520, in wrapper
v = func(*args, **kwargs)
File "/home/ubuntu/skylark/skylark/compute/gcp/gcp_server.py", line 84, in tags
return self.get_instance_property("labels")
File "/home/ubuntu/skylark/skylark/compute/gcp/gcp_server.py", line 60, in get_instance_property
return self.get_gcp_instance()[prop]
KeyError: 'labels'
With the latest commit, even for 10 chuncks of 1 MB each for us-east<->us-west transfer the first 9 chunks are relayed in ~2 seconds. For the last chunk relay does not complete in one case after even 42 mins.
To recreate:
python skylark/test/test_replicator_client.py --gateway-docker-image $SKYLARK_DOCKER_IMAGE --skip-upload --n-chunks 10 --chunk-size-mb 1 --num-gateways 1 --src-region aws:us-east-1 --dest-region aws:us-west-1 --gcp-instance-class None
On the remote gateway, the configs are not being returned.
For example, this line here throws up a key not-found error.
As of now all the keys are stored at multiple locations. Most are in data/config.json
but some are also in other locations (e.g., gcp key, azure blob keys). Further, each sub-module tries to authorize each time. We need to a) Consolidate all the keys, b) Have a uniform way of reading keys (input param vs environment variable vs ? ). c) Get rid of code duplication and have one module that deals with authentication.
This warrants some discussion, given we have seen some techniques of authentication to be buggy previously.
If you are unable to access the blob (are able to create a container, but unable to add blobs to the container) add the following two permissions : Storage Blob Data Contributor
and Storage Queue Data Contributor
to the skylark
Azure app through the Access Control (IAM)
page on the Container's webpage.
Exception:
Azure Blob Storage v12.9.0
Creating container:43b77945-de57-4980-a082-fb21979e81eb
Uploading to Azure Storage as blob:
test.txt
Exception:
This request is not authorized to perform this operation using this permission.
RequestId:509f5a35-d01e-001c-6304-2929af000000
Time:2022-02-23T22:30:39.3449575Z
ErrorCode:AuthorizationPermissionMismatch
$ skylark
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 584, in _build_master
ws.require(__requires__)
File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 901, in require
needed = self.resolve(parse_requirements(requirements))
File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 792, in resolve
raise VersionConflict(dist, req).with_context(dependent_req)
pkg_resources.ContextualVersionConflict: (Click 7.0 (/usr/lib/python3/dist-packages), Requirement.parse('click>=7.1.2'), {'flask'})
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ubuntu/.local/bin/skylark", line 6, in <module>
from pkg_resources import load_entry_point
File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 3254, in <module>
def _initialize_master_working_set():
File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 3237, in _call_aside
f(*args, **kwargs)
File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 3266, in _initialize_master_working_set
working_set = WorkingSet._build_master()
File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 586, in _build_master
return cls._build_from_requirements(__requires__)
File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 599, in _build_from_requirements
dists = ws.resolve(reqs, Environment())
File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 787, in resolve
raise DistributionNotFound(req, requirers)
pkg_resources.DistributionNotFound: The 'click>=7.1.2' distribution was not found and is required by flask
Trace
Traceback (most recent call last):
File "skylark/test/test_replicator_client.py", line 130, in <module>
main(parse_args())
File "skylark/test/test_replicator_client.py", line 89, in main
rc = ReplicatorClient(
File "/home/ubuntu/skylark/skylark/replicate/replicator_client.py", line 60, in __init__
do_parallel(lambda fn: fn(), jobs)
File "/home/ubuntu/skylark/skylark/utils/utils.py", line 67, in do_parallel
args, result = future.result()
File "/home/ubuntu/miniconda/lib/python3.8/concurrent/futures/_base.py", line 437, in result
return self.__get_result()
File "/home/ubuntu/miniconda/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
raise self._exception
File "/home/ubuntu/miniconda/lib/python3.8/concurrent/futures/thread.py", line 57, in run
result = self.fn(*self.args, **self.kwargs)
File "/home/ubuntu/skylark/skylark/utils/utils.py", line 60, in wrapped_fn
return args, func(args)
File "/home/ubuntu/skylark/skylark/replicate/replicator_client.py", line 60, in <lambda>
do_parallel(lambda fn: fn(), jobs)
File "/home/ubuntu/skylark/skylark/compute/gcp/gcp_cloud_provider.py", line 167, in configure_default_firewall
op = compute.firewalls().insert(project=self.gcp_project, body=fw_body).execute()
File "/home/ubuntu/miniconda/lib/python3.8/site-packages/googleapiclient/_helpers.py", line 131, in positional
_wrapper
return wrapped(*args, **kwargs)
File "/home/ubuntu/miniconda/lib/python3.8/site-packages/googleapiclient/http.py", line 937, in execute
raise HttpError(resp, content, uri=self.uri)
googleapiclient.errors.HttpError: <HttpError 404 when requesting https://compute.googleapis.com/compute/v1/proje
cts/skylark-shishir/global/firewalls?alt=json returned "The resource 'projects/skylark-shishir/global/networks/d
efault' was not found". Details: "[{'message': "The resource 'projects/skylark-shishir/global/networks/default'
was not found", 'domain': 'global', 'reason': 'notFound'}]">
To recreate:
python skylark/test/test_replicator_client.py --gateway-docker-image $SKYLARK_ DOCKER_IMAGE --skip-upload --n-chunks 512 --chunk-size-mb 10 --num-gateways 1 --src-region a ws:us-east-1 --dest-region aws:us-west-1 --gcp-instance-class n2 --gcp-project skylark-shishir
test_replicator_client.py
has now been modified to read azure subscription from OS variable similar to the cli
. It has an option to override the OS if a subscription has been input from the command line --azure-subscription when invoking test_replicator. So, to default back and read from OS variable, we need to not input the subscription. args.azure-subscription then defaults to None. However, this raises the following issue.
2022-01-28 02:54:38.316 | INFO | __main__:main:178 - Provisioning gateway instances Traceback (most recent call last): File "skylark/test/test_replicator_client.py", line 206, in <module> main(parse_args()) File "skylark/test/test_replicator_client.py", line 179, in main rc.provision_gateways( File "/home/ubuntu/skylark/skylark/replicate/replicator_client.py", line 76, in provision_gateways assert len(azure_regions_to_provision) ==0, "Azure not enabled" AssertionError: Azure not enabled
Command to replicate:
python skylark/test/test_replicator_client.py --gateway-docker-image $SKYLARK_DOCKER_IMAGE --n-chunks 100 --chunk-size-mb 4 --num-gateways 1 --src-region azure:eastus --dest-region azure:westus --gcp-project skylark-shishir --bucket-prefix t11
Workaround for now: just include --azure-subscription
At the moment, errors are consumed silently and users have to SSH into an instance or use the log viewer to query the state of the gateway.
Upon exiting, logs from the gateway should be copied back to the replicator client to /tmp
alongside the profile and saved logs from the replicator client. We should save all state including the command they ran. This will make debugging much easier since users can send us these logs via a github issue.
Linked issue: #163
Command: skylark replicate-random aws:us-east-1 azure:eastus -n 10 -s $((1024 * 68)) --chunk-size-mb 68
Docker: SKYLARK_DOCKER_IMAGE=ghcr.io/parasj/skylark:local-a3059934af58de4641a93cc87cab0219
02:56:47 [DEBUG] Starting gateway ab110d95-7b83-4cec-b9dc-400255f3166e:azure:eastus:skylark-azure-0e5e1e80daff456f828285adc72f0e6b, host: 52.142.46.50: Gate
way started b6778e3a8693c9520d354c1e1da621621a38da646744385df8466c5f91ed3ba7
02:56:48 [DEBUG] Starting gateway aws:us-east-1:i-0e9e40d93ee201d79, host: 52.55.133.29: Gateway started 0c1c5df09e1d913b78a760e7e754b10df63e5c1fe3bfa9487be
435e35bd5b735
02:58:54 [DEBUG] Install gateway package on instances: 138.89s
Traceback (most recent call last):
File "/home/ubuntu/miniconda/bin/skylark", line 33, in <module>
sys.exit(load_entry_point('skylark', 'console_scripts', 'skylark')())
File "/home/ubuntu/miniconda/lib/python3.8/site-packages/typer/main.py", line 214, in __call__
return get_command(self)(*args, **kwargs)
File "/home/ubuntu/miniconda/lib/python3.8/site-packages/click/core.py", line 1128, in __call__
return self.main(*args, **kwargs)
File "/home/ubuntu/miniconda/lib/python3.8/site-packages/click/core.py", line 1053, in main
rv = self.invoke(ctx)
File "/home/ubuntu/miniconda/lib/python3.8/site-packages/click/core.py", line 1659, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/ubuntu/miniconda/lib/python3.8/site-packages/click/core.py", line 1395, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/ubuntu/miniconda/lib/python3.8/site-packages/click/core.py", line 754, in invoke
return __callback(*args, **kwargs)
File "/home/ubuntu/miniconda/lib/python3.8/site-packages/typer/main.py", line 500, in wrapper
return callback(**use_params) # type: ignore
File "/home/ubuntu/skylark/skylark/cli/cli.py", line 162, in replicate_random
rc.provision_gateways(reuse_gateways)
File "/home/ubuntu/skylark/skylark/replicate/replicator_client.py", line 183, in provision_gateways
do_parallel(lambda arg: arg[0].start_gateway(arg[1], gateway_docker_image=self.gateway_docker_image), args, n=-1)
File "/home/ubuntu/skylark/skylark/utils/utils.py", line 68, in do_parallel
args, result = future.result()
File "/home/ubuntu/miniconda/lib/python3.8/concurrent/futures/_base.py", line 437, in result
return self.__get_result()
File "/home/ubuntu/miniconda/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
raise self._exception
File "/home/ubuntu/miniconda/lib/python3.8/concurrent/futures/thread.py", line 57, in run
result = self.fn(*self.args, **self.kwargs)
File "/home/ubuntu/skylark/skylark/utils/utils.py", line 61, in wrapped_fn
return args, func(args)
File "/home/ubuntu/skylark/skylark/replicate/replicator_client.py", line 183, in <lambda>
do_parallel(lambda arg: arg[0].start_gateway(arg[1], gateway_docker_image=self.gateway_docker_image), args, n=-1)
File "/home/ubuntu/skylark/skylark/compute/server.py", line 203, in start_gateway
check_stderr(self.run_command(make_sysctl_tcp_tuning_command(cc="bbr" if use_bbr else "cubic")))
File "/home/ubuntu/skylark/skylark/compute/server.py", line 174, in run_command
client = self.ssh_client
File "/home/ubuntu/skylark/skylark/compute/server.py", line 101, in ssh_client
self.client = self.get_ssh_client_impl()
File "/home/ubuntu/skylark/skylark/compute/azure/azure_server.py", line 151, in get_ssh_client_impl
ssh_client.connect(
File "/home/ubuntu/miniconda/lib/python3.8/site-packages/paramiko/client.py", line 349, in connect
retry_on_signal(lambda: sock.connect(addr))
File "/home/ubuntu/miniconda/lib/python3.8/site-packages/paramiko/util.py", line 283, in retry_on_signal
return function()
File "/home/ubuntu/miniconda/lib/python3.8/site-packages/paramiko/client.py", line 349, in <lambda>
retry_on_signal(lambda: sock.connect(addr))
TimeoutError: [Errno 110] Connection timed out
This will enable supporting more buckets than just the committed ones. I added some preliminary code for this but need an account url.
At the moment, all storage is reported as gigabytes (10^9 bytes), not a gibibyte (2^30 bytes). AWS, GCP and Azure all refer to a GB as a GiB (1024 * 1024 * 1024 bytes).
replicate-json
broken by some plans
Query object sizes (fake_imagenet/validation-00127-of-00128): 100%|โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ| 1152/1152 [00:14<00:00, 81.29it/s]
Traceback (most recent call last):
File "/home/ubuntu/miniconda3/bin/skylark", line 33, in <module>
sys.exit(load_entry_point('skylark', 'console_scripts', 'skylark')())
File "/home/ubuntu/miniconda3/lib/python3.8/site-packages/typer/main.py", line 214, in __call__
return get_command(self)(*args, **kwargs)
File "/home/ubuntu/miniconda3/lib/python3.8/site-packages/click/core.py", line 1128, in __call__
return self.main(*args, **kwargs)
File "/home/ubuntu/miniconda3/lib/python3.8/site-packages/click/core.py", line 1053, in main
rv = self.invoke(ctx)
File "/home/ubuntu/miniconda3/lib/python3.8/site-packages/click/core.py", line 1659, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/ubuntu/miniconda3/lib/python3.8/site-packages/click/core.py", line 1395, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/ubuntu/miniconda3/lib/python3.8/site-packages/click/core.py", line 754, in invoke
return __callback(*args, **kwargs)
File "/home/ubuntu/miniconda3/lib/python3.8/site-packages/typer/main.py", line 500, in wrapper
return callback(**use_params) # type: ignore
File "/home/ubuntu/skylark/skylark/cli/cli.py", line 277, in replicate_json
job = rc.run_replication_plan(job)
File "/home/ubuntu/skylark/skylark/replicate/replicator_client.py", line 242, in run_replication_plan
assert len(chunk_batches) == len(src_instances), f"{len(chunk_batches)} batches, expected {len(src_instances)}"
AssertionError: 2 batches, expected 3
sarah-skylark-europe-north1-a fake_imagenet/train-00000-of-01024
Command:
# create plan
skylark solver solve-throughput gcp:europe-north1-a gcp:us-west4-a 16 -o plan.json --max-instances 16;
# run replication (obj store)
skylark replicate-json ${filename} \
--gcp-project skylark-sarah \
--source-bucket $src_bucket \
--dest-bucket $dest_bucket \
--key-prefix fake_imagenet > data/results/${experiment}/obj-store-logs.txt
Need a way to set-up Skylark docker on the spun-up nodes.
(base) ubuntu@sky1:~/skylark$ bash scripts/az_cleanup.sh If you want to filter this list to a subset, pass an argument as a prefix: az_cleanup.sh <prefix> scripts/az_cleanup.sh: line 10: jq: command not found Exception ignored in: <_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF-8'> BrokenPipeError: [Errno 32] Broken pipe No groups found
#65 only captures GCP and Azure credentials.
It should also ask for the location of the GCP credential JSON file.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.