GithubHelp home page GithubHelp logo

skyplane-project / skyplane Goto Github PK

View Code? Open in Web Editor NEW
996.0 996.0 58.0 9.95 MB

๐Ÿ”ฅ Blazing fast bulk data transfers between any cloud ๐Ÿ”ฅ

Home Page: https://skyplane.org

License: Apache License 2.0

Python 99.09% Dockerfile 0.29% Shell 0.39% Nix 0.22%
aws azure cloud cloud-computing data-antigravity data-transfer datasync gcp multi-cloud multicloud replication rsync sky skyplane

skyplane's People

Contributors

abiswal2001 avatar antonzabreyko avatar ethanmehta avatar haileyjang avatar jasonding0401 avatar killerdbob avatar lynnliu030 avatar mech-a avatar milesturin avatar parasj avatar phi-line avatar s3kim2018 avatar samkumar avatar sangjun-kang avatar sarahwooders avatar shadaj avatar shishirpatil avatar simon-mo avatar troycarlson avatar vdm avatar xtram1 avatar xutingl avatar zizhong avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

skyplane's Issues

Azure resource groups

Hey Sam, is there a way we can deal with Azure resource groups in a better fashion?

  1. The attempts to tear down all resource groups is bad for the following reasons
    a) Some resource groups may be sharing other class of resources that tearing them down would adversely affect other resources.
    b) The tearing down really doesn't happen (see below)

Maybe just retain the resource groups and re-use them? Or even create one resource group and use that one only?

04:59:15 [DEBUG] Loaded gcp_project: skylark-shishir, azure_subscription: ab110d95-7b83-4cec-b9dc-400255f3166e
04:59:15 [WARN]  Warning: soft file limit is set to 1024, increasing for process with `sudo prlimit --pid 7325 --nofile=1048576:1048576`
04:59:22 [DEBUG] Cloud SSH key initialization: 7.27s
04:59:22 [WARN]  Instances will remain up and may result in continued cloud billing. Remember to call `skylark deprovision` to deprovision gateways.
04:59:25 [WARN]  Warning: malformed Azure resource group skylark-azure-f658ffe734944cfabcb792d6a9614af5 found and ignored. You should go to the Microsoft Az
ure portal, investigate this manually, and delete any orphaned resources that may be allocated.
04:59:27 [WARN]  Warning: malformed Azure resource group skylark-azure-b96e2902b57a49c786a5893755fac160 found and ignored. You should go to the Microsoft Az
ure portal, investigate this manually, and delete any orphaned resources that may be allocated.
04:59:29 [WARN]  Warning: malformed Azure resource group skylark-azure-271cf3c639914dea9c4a55dad85c2a97 found and ignored. You should go to the Microsoft Az
ure portal, investigate this manually, and delete any orphaned resources that may be allocated.
04:59:30 [WARN]  Warning: malformed Azure resource group skylark-azure-0eaf46df242e42f58e32ae444bcc4e5c found and ignored. You should go to the Microsoft Az
ure portal, investigate this manually, and delete any orphaned resources that may be allocated.
04:59:32 [WARN]  Warning: malformed Azure resource group skylark-azure-d56a16bda7bc4c6586caf492ee5759b3 found and ignored. You should go to the Microsoft Az
ure portal, investigate this manually, and delete any orphaned resources that may be allocated.
04:59:34 [WARN]  Warning: malformed Azure resource group skylark-azure-15d02bf2734d452da88bdde1f99a1354 found and ignored. You should go to the Microsoft Az
ure portal, investigate this manually, and delete any orphaned resources that may be allocated.
04:59:35 [WARN]  Warning: malformed Azure resource group skylark-azure-83243f6d1d224aaf92c087345d9d4211 found and ignored. You should go to the Microsoft Az
ure portal, investigate this manually, and delete any orphaned resources that may be allocated.
04:59:37 [WARN]  Warning: malformed Azure resource group skylark-azure-d577b6da64e54c8da11e1476ad316e82 found and ignored. You should go to the Microsoft Az
ure portal, investigate this manually, and delete any orphaned resources that may be allocated.
04:59:39 [WARN]  Warning: malformed Azure resource group skylark-azure-4b678bb8d0a44a059b6c8016c3d23a76 found and ignored. You should go to the Microsoft Az
ure portal, investigate this manually, and delete any orphaned resources that may be allocated.
04:59:41 [WARN]  Warning: malformed Azure resource group skylark-azure-0b1b2344bb364684bd644982a98c8681 found and ignored. You should go to the Microsoft Az
ure portal, investigate this manually, and delete any orphaned resources that may be allocated.
04:59:42 [WARN]  Warning: malformed Azure resource group skylark-azure-751b38d4048d4155a3437497f60e2c21 found and ignored. You should go to the Microsoft Az
ure portal, investigate this manually, and delete any orphaned resources that may be allocated.
04:59:44 [WARN]  Warning: malformed Azure resource group skylark-azure-a1d03fe6d8c1434096161391aa415030 found and ignored. You should go to the Microsoft Az
ure portal, investigate this manually, and delete any orphaned resources that may be allocated.```

[bug] Import error when installed via pip rather than via conda

Import issue when running skylark replicate-random. Currently solution is to use conda instead.

(env) ubuntu@ip-172-31-52-37:~/skylark$ skylark replicate-random aws:ap-northeast-1 aws:eu-central-1 --chunk-size-mb 8 --n-chunks 1024 --num-gateways 1 --num-outgoing-connections 64
Traceback (most recent call last):
  File "/home/ubuntu/skylark/env/bin/skylark", line 6, in <module>
    from pkg_resources import load_entry_point
  File "/home/ubuntu/skylark/env/lib/python3.8/site-packages/pkg_resources/__init__.py", line 3252, in <module>
    def _initialize_master_working_set():
  File "/home/ubuntu/skylark/env/lib/python3.8/site-packages/pkg_resources/__init__.py", line 3235, in _call_aside
    f(*args, **kwargs)
  File "/home/ubuntu/skylark/env/lib/python3.8/site-packages/pkg_resources/__init__.py", line 3264, in _initialize_master_working_set
    working_set = WorkingSet._build_master()
  File "/home/ubuntu/skylark/env/lib/python3.8/site-packages/pkg_resources/__init__.py", line 583, in _build_master
    ws.require(__requires__)
  File "/home/ubuntu/skylark/env/lib/python3.8/site-packages/pkg_resources/__init__.py", line 900, in require
    needed = self.resolve(parse_requirements(requirements))
  File "/home/ubuntu/skylark/env/lib/python3.8/site-packages/pkg_resources/__init__.py", line 786, in resolve
    raise DistributionNotFound(req, requirers)
pkg_resources.DistributionNotFound: The 'grpcio-status<2.0dev,>=1.33.2; extra == "grpc"' distribution was not found and is required by google-api-core

AWS DataSync CLI: IAM access roles take time to propagate

While creating an IAM role for DataSync, it takes time for the permissions to propagate. We need to have a way to retry line 84 in skylark/skylark/cli/cli_aws.py

(base) ubuntu@sky1:~/skylark$ skylark aws cp-datasync imagenet-records-useast1 skylark-test-us-east-1 fake_imagenet
Creating datasync-role
IAM role ARN: arn:aws:iam::693554043898:role/datasync-role
Traceback (most recent call last):
  File "/home/ubuntu/miniconda/bin/skylark", line 33, in <module>
    sys.exit(load_entry_point('skylark', 'console_scripts', 'skylark')())
  File "/home/ubuntu/miniconda/lib/python3.8/site-packages/typer/main.py", line 214, in __call__
    return get_command(self)(*args, **kwargs)
  File "/home/ubuntu/miniconda/lib/python3.8/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/home/ubuntu/miniconda/lib/python3.8/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/home/ubuntu/miniconda/lib/python3.8/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/ubuntu/miniconda/lib/python3.8/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/ubuntu/miniconda/lib/python3.8/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/ubuntu/miniconda/lib/python3.8/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/home/ubuntu/miniconda/lib/python3.8/site-packages/typer/main.py", line 500, in wrapper
    return callback(**use_params)  # type: ignore
  File "/home/ubuntu/skylark/skylark/cli/cli_aws.py", line 84, in cp_datasync
    src_response = ds_client_src.create_location_s3(
  File "/home/ubuntu/miniconda/lib/python3.8/site-packages/botocore/client.py", line 391, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/home/ubuntu/miniconda/lib/python3.8/site-packages/botocore/client.py", line 719, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.errorfactory.InvalidRequestException: An error occurred (InvalidRequestException) when calling the CreateLocationS3 operation: Unable to assume rol
e. Reason: Access denied when calling sts:AssumeRole; roleName=datasync-role, roleArn=arn:aws:iam::693554043898:role/datasync-role

skylark deprovision

In the presence of large number of instances, the skylark deprovision command starts the decomission. But does not return, rather gets stuck in-between (more than 30 min).

Deprovisioning 30 instances
Deprovisioning (azure:westus):  53%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–‰                                         | 16/30 [04:19<03:02, 13.06s/it]

Azure to azure obj store transfer

There seems to be an error in how we check for presence of obj stores. This wasn't happening before.

Command:
./scripts/experiment.sh azure:westus azure:eastus

Trace:

SKYLARK_DOCKER_IMAGE=ghcr.io/parasj/skylark:local-ab2b075eb2dda73a2cbcc9e9434d42b5
03:26:23 [WARN]  Gurobi not installed, using CoinOR instead.
03:26:30 [DEBUG] Solve throughput problem: 6.38s
03:26:30 [DEBUG] Total cost: $0.0202 (egress: $0.0200, instance: $0.0002)
03:26:30 [DEBUG] Total throughput: [25.0] Gbps
03:26:30 [DEBUG] Total runtime: 0.32s
03:26:30 [DEBUG] Instance regions: [azure:eastus=2, azure:westus=2]
03:26:30 [DEBUG] Flow matrix:
03:26:30 [DEBUG]        azure:westus -> azure:eastus: 25.00 Gbps with 128.0 connections, 64.0GB (link capacity = 24.91 Gbps)
/home/ubuntu/skylark/skylark/replicate/solver.py:317: RuntimeWarning: invalid value encountered in double_scalars
  average_egress_conns = [np.ceil(solution.var_conn[i, :].sum() / solution.var_instances_per_region[i]) for i in range(len(regions))]
/home/ubuntu/skylark/skylark/replicate/solver.py:318: RuntimeWarning: invalid value encountered in double_scalars
  average_ingress_conns = [np.ceil(solution.var_conn[:, i].sum() / solution.var_instances_per_region[i]) for i in range(len(regions))]
03:26:30 [WARN]  azure:westus:0:0c -> azure:eastus (partial): 64.0c of 64.0c remaining
03:26:30 [WARN]  azure:westus:1:0c -> azure:eastus: 64.0c of 0c remaining
03:26:30 [WARN]  azure:westus:0:0c -> azure:eastus:0:0c: 64c remaining
03:26:30 [WARN]  azure:westus:1:0c -> azure:eastus:1:0c: 64.0c remaining
03:26:30 [WARN]  Scaling connections by 1.00x
data/plan/azure-westus_azure-eastus_8_fake_imagenet.json
03:26:31 [DEBUG] Loaded gcp_project: skylark-shishir, azure_subscription: ab110d95-7b83-4cec-b9dc-400255f3166e
03:26:31 [WARN]  Warning: soft file limit is set to 1024, increasing for process with `sudo prlimit --pid 23364 --nofile=1048576:1048576`
03:26:33 [INFO]  Searching for orphaned Azure resources...
03:26:38 [INFO]  Done cleaning up orphaned Azure resources
03:26:38 [DEBUG] Cloud SSH key initialization: 6.08s
03:26:38 [WARN]  Instances will remain up and may result in continued cloud billing. Remember to call `skylark deprovision` to deprovision gateways.        
03:26:42 [DEBUG] Provisioning instances and waiting to boot: 0.00s
03:26:42 [DEBUG] Waiting for ab110d95-7b83-4cec-b9dc-400255f3166e:azure:westus:skylark-azure-3e860ac0553e418b91625ae4fd53764e to be ready
03:26:42 [DEBUG] Waiting for ab110d95-7b83-4cec-b9dc-400255f3166e:azure:westus:skylark-azure-30c7ab543cc044758046f366a24a80f5 to be ready
03:26:42 [DEBUG] Waiting for ab110d95-7b83-4cec-b9dc-400255f3166e:azure:eastus:skylark-azure-9feb4f96d68444c99d02c2a3662008f6 to be ready
03:26:42 [DEBUG] Waiting for ab110d95-7b83-4cec-b9dc-400255f3166e:azure:eastus:skylark-azure-4c72ff9e8c304dc993dca5666d03b998 to be ready
03:26:43 [DEBUG] ab110d95-7b83-4cec-b9dc-400255f3166e:azure:eastus:skylark-azure-9feb4f96d68444c99d02c2a3662008f6 is ready
03:26:43 [DEBUG] ab110d95-7b83-4cec-b9dc-400255f3166e:azure:eastus:skylark-azure-4c72ff9e8c304dc993dca5666d03b998 is ready
03:26:43 [DEBUG] ab110d95-7b83-4cec-b9dc-400255f3166e:azure:westus:skylark-azure-3e860ac0553e418b91625ae4fd53764e is ready
03:26:43 [DEBUG] ab110d95-7b83-4cec-b9dc-400255f3166e:azure:westus:skylark-azure-30c7ab543cc044758046f366a24a80f5 is ready
03:26:45 [DEBUG] Starting gateway ab110d95-7b83-4cec-b9dc-400255f3166e:azure:eastus:skylark-azure-9feb4f96d68444c99d02c2a3662008f6, host: 20.106.252.73: Ins
talling docker
03:26:45 [DEBUG] Starting gateway ab110d95-7b83-4cec-b9dc-400255f3166e:azure:eastus:skylark-azure-4c72ff9e8c304dc993dca5666d03b998, host: 20.106.249.118: In
stalling docker
03:26:45 [DEBUG] Starting gateway ab110d95-7b83-4cec-b9dc-400255f3166e:azure:westus:skylark-azure-3e860ac0553e418b91625ae4fd53764e, host: 104.210.32.190: In
stalling docker
03:26:45 [DEBUG] Starting gateway ab110d95-7b83-4cec-b9dc-400255f3166e:azure:westus:skylark-azure-30c7ab543cc044758046f366a24a80f5, host: 104.210.39.87: Ins
talling docker
03:26:46 [DEBUG] Starting gateway ab110d95-7b83-4cec-b9dc-400255f3166e:azure:eastus:skylark-azure-4c72ff9e8c304dc993dca5666d03b998, host: 20.106.249.118: St
arting monitoring
03:26:46 [DEBUG] Starting gateway ab110d95-7b83-4cec-b9dc-400255f3166e:azure:eastus:skylark-azure-9feb4f96d68444c99d02c2a3662008f6, host: 20.106.252.73: Sta
rting monitoring
03:26:47 [DEBUG] Starting gateway ab110d95-7b83-4cec-b9dc-400255f3166e:azure:eastus:skylark-azure-4c72ff9e8c304dc993dca5666d03b998, host: 20.106.249.118: Pu
lling docker image
03:26:47 [DEBUG] Starting gateway ab110d95-7b83-4cec-b9dc-400255f3166e:azure:eastus:skylark-azure-9feb4f96d68444c99d02c2a3662008f6, host: 20.106.252.73: Pul
ling docker image
03:26:47 [DEBUG] Starting gateway ab110d95-7b83-4cec-b9dc-400255f3166e:azure:westus:skylark-azure-30c7ab543cc044758046f366a24a80f5, host: 104.210.39.87: Sta
rting monitoring
03:26:48 [DEBUG] Starting gateway ab110d95-7b83-4cec-b9dc-400255f3166e:azure:westus:skylark-azure-3e860ac0553e418b91625ae4fd53764e, host: 104.210.32.190: St
arting monitoring
03:26:48 [DEBUG] Starting gateway ab110d95-7b83-4cec-b9dc-400255f3166e:azure:westus:skylark-azure-30c7ab543cc044758046f366a24a80f5, host: 104.210.39.87: Pul
ling docker image
03:26:49 [DEBUG] Starting gateway ab110d95-7b83-4cec-b9dc-400255f3166e:azure:westus:skylark-azure-3e860ac0553e418b91625ae4fd53764e, host: 104.210.32.190: Pu
lling docker image
03:26:49 [ERROR] Failed to read AWS credentials locally No section: 'default'
03:26:49 [DEBUG] Starting gateway ab110d95-7b83-4cec-b9dc-400255f3166e:azure:eastus:skylark-azure-4c72ff9e8c304dc993dca5666d03b998, host: 20.106.249.118: St
arting gateway container ghcr.io/parasj/skylark:local-ab2b075eb2dda73a2cbcc9e9434d42b5
03:26:49 [ERROR] Failed to read AWS credentials locally No section: 'default'
03:26:49 [DEBUG] Starting gateway ab110d95-7b83-4cec-b9dc-400255f3166e:azure:eastus:skylark-azure-9feb4f96d68444c99d02c2a3662008f6, host: 20.106.252.73: Sta
rting gateway container ghcr.io/parasj/skylark:local-ab2b075eb2dda73a2cbcc9e9434d42b5
03:26:50 [DEBUG] Starting gateway ab110d95-7b83-4cec-b9dc-400255f3166e:azure:eastus:skylark-azure-4c72ff9e8c304dc993dca5666d03b998, host: 20.106.249.118: Ga
teway started ed486c186ad74bfb94e2b7308f505b5178efb893444481de1c0d908d4b5c4cd5
03:26:50 [DEBUG] Starting gateway ab110d95-7b83-4cec-b9dc-400255f3166e:azure:eastus:skylark-azure-9feb4f96d68444c99d02c2a3662008f6, host: 20.106.252.73: Gat
eway started d4bfccca38054fb09af98769b0bfe5c4961328b49db1d229244fa0f319daba64
03:26:52 [ERROR] Failed to read AWS credentials locally No section: 'default'
03:26:52 [DEBUG] Starting gateway ab110d95-7b83-4cec-b9dc-400255f3166e:azure:westus:skylark-azure-30c7ab543cc044758046f366a24a80f5, host: 104.210.39.87: Sta
rting gateway container ghcr.io/parasj/skylark:local-ab2b075eb2dda73a2cbcc9e9434d42b5
03:26:52 [ERROR] Failed to read AWS credentials locally No section: 'default'
03:26:52 [DEBUG] Starting gateway ab110d95-7b83-4cec-b9dc-400255f3166e:azure:westus:skylark-azure-3e860ac0553e418b91625ae4fd53764e, host: 104.210.32.190: St
arting gateway container ghcr.io/parasj/skylark:local-ab2b075eb2dda73a2cbcc9e9434d42b5
03:26:53 [DEBUG] Starting gateway ab110d95-7b83-4cec-b9dc-400255f3166e:azure:westus:skylark-azure-30c7ab543cc044758046f366a24a80f5, host: 104.210.39.87: Gat
eway started 138dcf9be7f63f411f4080d16d89884df30d7c31718681c8f4aa4fa31557a32f
03:26:53 [DEBUG] Starting gateway ab110d95-7b83-4cec-b9dc-400255f3166e:azure:westus:skylark-azure-3e860ac0553e418b91625ae4fd53764e, host: 104.210.32.190: Ga
teway started 849f977b19c1473da01a0a4c4612f37570a99def0a423226014314a6e6a03b59
03:26:54 [DEBUG] Install gateway package on instances: 12.42s
03:26:54 [INFO]  Provisioned ReplicationTopologyGateway(region='azure:westus', instance=0): http://104.210.32.190:8888/container/849f977b19c1
03:26:54 [INFO]  Provisioned ReplicationTopologyGateway(region='azure:westus', instance=1): http://104.210.39.87:8888/container/138dcf9be7f6
03:26:54 [INFO]  Provisioned ReplicationTopologyGateway(region='azure:eastus', instance=0): http://20.106.252.73:8888/container/d4bfccca3805
03:26:54 [INFO]  Provisioned ReplicationTopologyGateway(region='azure:eastus', instance=1): http://20.106.249.118:8888/container/ed486c186ad7
Query object sizes (fake_imagenet/validation-00127-of-00128): 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 1152/1152 [00:05<00:00, 203.47it/s]
03:27:01 [DEBUG] Building chunk requests: 0.00s
03:27:01 [DEBUG] Sending 1152 chunk requests to 104.210.32.190
03:27:02 [DEBUG] Dispatch chunk requests: 1.12s
03:27:02 [WARN]  ==>Container already exists. Deletion started. Try restarting after sufficient gap
03:27:02 [WARN]  ==> Alternatively use a diff bucket name with `--bucket-prefix`**
**

Race condition

Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/home/ubuntu/miniconda/lib/python3.8/concurrent/futures/thread.py", line 40, in _python_exit
    t.join()
  File "/home/ubuntu/miniconda/lib/python3.8/threading.py", line 1011, in join
    self._wait_for_tstate_lock()
  File "/home/ubuntu/miniconda/lib/python3.8/threading.py", line 1027, in _wait_for_tstate_lock
    elif lock.acquire(block, timeout):

[bug] Azure resource group not found error

Related to #146 I think?

23:59:47 [DEBUG] Provisioning instances and waiting to boot: 215.36s
Traceback (most recent call last):
  File "/home/ubuntu/miniconda/lib/python3.8/site-packages/azure/core/polling/base_polling.py", line 483, in run
    self._poll()
  File "/home/ubuntu/miniconda/lib/python3.8/site-packages/azure/core/polling/base_polling.py", line 527, in _poll
    _raise_if_bad_http_status_and_method(self._pipeline_response.http_response)
  File "/home/ubuntu/miniconda/lib/python3.8/site-packages/azure/core/polling/base_polling.py", line 112, in _raise_if_bad_http_status_and_method
    raise BadStatus(
azure.core.polling.base_polling.BadStatus: Invalid return status 404 for 'GET' operation

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ubuntu/miniconda/bin/skylark", line 33, in <module>
    sys.exit(load_entry_point('skylark', 'console_scripts', 'skylark')())
  File "/home/ubuntu/miniconda/lib/python3.8/site-packages/typer/main.py", line 214, in __call__
    return get_command(self)(*args, **kwargs)
  File "/home/ubuntu/miniconda/lib/python3.8/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/home/ubuntu/miniconda/lib/python3.8/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/home/ubuntu/miniconda/lib/python3.8/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/ubuntu/miniconda/lib/python3.8/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/ubuntu/miniconda/lib/python3.8/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/home/ubuntu/miniconda/lib/python3.8/site-packages/typer/main.py", line 500, in wrapper
    return callback(**use_params)  # type: ignore
  File "/home/ubuntu/skylark/skylark/cli/cli.py", line 245, in replicate_json
    rc.provision_gateways(reuse_gateways)
  File "/home/ubuntu/skylark/skylark/replicate/replicator_client.py", line 148, in provision_gateways
    results = do_parallel(
  File "/home/ubuntu/skylark/skylark/utils/utils.py", line 72, in do_parallel
    args, result = future.result()
  File "/home/ubuntu/miniconda/lib/python3.8/concurrent/futures/_base.py", line 437, in result
    return self.__get_result()
  File "/home/ubuntu/miniconda/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
    raise self._exception
  File "/home/ubuntu/miniconda/lib/python3.8/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/home/ubuntu/skylark/skylark/utils/utils.py", line 65, in wrapped_fn
    return args, func(args)
  File "/home/ubuntu/skylark/skylark/replicate/replicator_client.py", line 139, in provision_gateway_instance
    server = self.azure.provision_instance(subregion, self.azure_instance_class)
  File "/home/ubuntu/skylark/skylark/compute/azure/azure_cloud_provider.py", line 409, in provision_instance
    nic_result = poller.result()
  File "/home/ubuntu/miniconda/lib/python3.8/site-packages/azure/core/polling/_poller.py", line 255, in result
    self.wait(timeout)
  File "/home/ubuntu/miniconda/lib/python3.8/site-packages/azure/core/tracing/decorator.py", line 83, in wrapper_use_tracer
    return func(*args, **kwargs)
  File "/home/ubuntu/miniconda/lib/python3.8/site-packages/azure/core/polling/_poller.py", line 275, in wait
    raise self._exception # type: ignore
  File "/home/ubuntu/miniconda/lib/python3.8/site-packages/azure/core/polling/_poller.py", line 192, in _start
    self._polling_method.run()
  File "/home/ubuntu/miniconda/lib/python3.8/site-packages/azure/core/polling/base_polling.py", line 487, in run
    raise HttpResponseError(
azure.core.exceptions.HttpResponseError: (ResourceNotFound) The Resource 'Microsoft.Network/networkInterfaces/skylark-azure-09d7c8e030714e3faca55446a6170e1f
-vm-nic' under resource group 'skylark' was not found. For more details please go to https://aka.ms/ARMResourceNotFoundFix
Code: ResourceNotFound
Message: The Resource 'Microsoft.Network/networkInterfaces/skylark-azure-09d7c8e030714e3faca55446a6170e1f-vm-nic' under resource group 'skylark' was not fou
nd. For more details please go to https://aka.ms/ARMResourceNotFoundFix

Command: ./scripts/experiment.sh aws:us-east-1 azure:westus

Handling containers/buckets deleted when objects are on flight

When the obj transfers are ongoing a certain resource (source/destination buxket) may get deleted. In such a scenario we need to be able to a) Capture and handle the exception gracefully and prevent the replicator being stalled, b) Inform the user of the stack trace. Given the current architecture the stack trace could be written to the logs on the gateways but is not printed on the user terminal.

Azure gateway is extremely slow

See below run

(base) ubuntu@ip-172-31-82-174:~/skylark$ source scripts/pack_docker.sh && python skylark/test/test_replicator_client.py --gateway-docker-image $SKYLARK_DOCKER_IMAGE --n-chunks 100 --chunk-size-mb 4 --num-gateways 1 --src-region azure:eastus --dest-region azure:westus --bucket-prefix test1234567890
Building docker image
[+] Building 2.6s (22/22) FINISHED
 => [internal] load build definition from Dockerfile                                                                              0.0s
 => => transferring dockerfile: 990B                                                                                              0.0s
 => [internal] load .dockerignore                                                                                                 0.0s
 => => transferring context: 434B                                                                                                 0.0s
 => resolve image config for docker.io/docker/dockerfile:1                                                                        0.1s
 => CACHED docker-image://docker.io/docker/dockerfile:1@sha256:42399d4635eddd7a9b8a24be879d2f9a930d0ed040a61324cfdf59ef1357b3b2   0.0s
 => [internal] load .dockerignore                                                                                                 0.0s
 => [internal] load build definition from Dockerfile                                                                              0.0s
 => [internal] load metadata for docker.io/library/python:3.8-slim                                                                0.1s
 => [stage-0  1/13] FROM docker.io/library/python:3.8-slim@sha256:95240f5291de3193c1299c5b2513f9bb99ecdae0fabd137156b0fb8c47afd6  0.0s
 => [internal] load build context                                                                                                 0.1s
 => => transferring context: 7.99MB                                                                                               0.1s
 => CACHED [stage-0  2/13] RUN echo 'net.ipv4.ip_local_port_range = 12000 65535' >> /etc/sysctl.conf                              0.0s
 => CACHED [stage-0  3/13] RUN echo 'fs.file-max = 1048576' >> /etc/sysctl.conf                                                   0.0s
 => CACHED [stage-0  4/13] RUN mkdir -p /etc/security/                                                                            0.0s
 => CACHED [stage-0  5/13] RUN echo '*                soft    nofile          1048576' >> /etc/security/limits.conf               0.0s
 => CACHED [stage-0  6/13] RUN echo '*                hard    nofile          1048576' >> /etc/security/limits.conf               0.0s
 => CACHED [stage-0  7/13] RUN echo 'root             soft    nofile          1048576' >> /etc/security/limits.conf               0.0s
 => CACHED [stage-0  8/13] RUN echo 'root             hard    nofile          1048576' >> /etc/security/limits.conf               0.0s
 => CACHED [stage-0  9/13] COPY scripts/requirements-gateway.txt /tmp/requirements-gateway.txt                                    0.0s
 => CACHED [stage-0 10/13] RUN --mount=type=cache,target=/root/.cache/pip pip install --no-cache-dir --compile -r /tmp/requireme  0.0s
 => CACHED [stage-0 11/13] WORKDIR /pkg                                                                                           0.0s
 => [stage-0 12/13] COPY . .                                                                                                      0.1s
 => [stage-0 13/13] RUN pip install -e .                                                                                          2.0s
 => exporting to image                                                                                                            0.1s
 => => exporting layers                                                                                                           0.1s
 => => writing image sha256:4c751e3f4bc3572597df2bd2a147e72aebbc3199c46eb7bffed15f23a45e0604                                      0.0s
 => => naming to docker.io/library/skylark                                                                                        0.0s
Uploading docker image to ghcr.io/parasj/skylark:local-0ac2daa948e124df4ab7aa4d5c445816
The push refers to repository [ghcr.io/parasj/skylark]
fb0a04656026: Pushed
2426c1884d13: Pushed
d5089b43742f: Layer already exists
b485462ae282: Layer already exists
d65daa64516b: Layer already exists
58dfc2b4e272: Layer already exists
8d558f297057: Layer already exists
264394ca8d1f: Layer already exists
42a596f0a6b9: Layer already exists
e3528ada1b37: Layer already exists
fe022393ef8d: Layer already exists
e4ab298cd14a: Layer already exists
ec8e20bb6d54: Layer already exists
51f094ff7b94: Layer already exists
1a40cb2669f8: Layer already exists
32034715e5d4: Layer already exists
7d0ebbe3f5d2: Layer already exists
local-0ac2daa948e124df4ab7aa4d5c445816: digest: sha256:0778f6c5df5cb3fa9134013eaf1ead2b003394569a18f2ea188f6b3b71d4d1e8 size: 3864
Deleted build cache objects:
e5pq85w786mj42tdk56afje4l
cmquw1liqm37jk29mw9bh3jnq
wgvaunwqmeluwarhmr2ubh1n0

Total reclaimed space: 7.971MB
SKYLARK_DOCKER_IMAGE=ghcr.io/parasj/skylark:local-0ac2daa948e124df4ab7aa4d5c445816

=================================================
  ______  _             _                 _
 / _____)| |           | |               | |
( (____  | |  _  _   _ | |  _____   ____ | |  _
 \____ \ | |_/ )| | | || | (____ | / ___)| |_/ )
 _____) )|  _ ( | |_| || | / ___ || |    |  _ (
(______/ |_| \_) \__  | \_)\_____||_|    |_| \_)
                (____/
=================================================

09:03:56 [INFO]  Not skipping upload, source bucket is test1234567890-skylark-eastus, destination bucket is test1234567890-skylark-westus
09:03:56 [INFO]  Creating test objects
09:03:57 [INFO]  Uploading 100 to bucket test1234567890-skylark-eastus
09:03:57 [INFO]  Creating replication client
09:03:57 [DEBUG] Loaded gcp_project: skylark-333700, azure_subscription: ab110d95-7b83-4cec-b9dc-400255f3166e
09:04:05 [DEBUG] Cloud SSH key initialization: 7.58s
09:04:05 [INFO]  Provisioning gateway instances
09:04:08 [DEBUG] Provisioning instances and waiting to boot: 0.00s
09:04:09 [DEBUG] Starting gateway ab110d95-7b83-4cec-b9dc-400255f3166e:azure:westus:skylark-azure-4e90f72dd70146ab931f76fbbf7d2121, host: 23.101.204.63: Installing docker
09:04:09 [DEBUG] Starting gateway ab110d95-7b83-4cec-b9dc-400255f3166e:azure:eastus:skylark-azure-9cc892cf8ddf47d5bf8f495602674bfc, host: 20.115.111.168: Installing docker
09:04:11 [DEBUG] Starting gateway ab110d95-7b83-4cec-b9dc-400255f3166e:azure:eastus:skylark-azure-9cc892cf8ddf47d5bf8f495602674bfc, host: 20.115.111.168: Starting monitoring
09:04:11 [DEBUG] Starting gateway ab110d95-7b83-4cec-b9dc-400255f3166e:azure:westus:skylark-azure-4e90f72dd70146ab931f76fbbf7d2121, host: 23.101.204.63: Starting monitoring
09:04:13 [DEBUG] Starting gateway ab110d95-7b83-4cec-b9dc-400255f3166e:azure:eastus:skylark-azure-9cc892cf8ddf47d5bf8f495602674bfc, host: 20.115.111.168: Pulling docker image
09:04:13 [DEBUG] Starting gateway ab110d95-7b83-4cec-b9dc-400255f3166e:azure:westus:skylark-azure-4e90f72dd70146ab931f76fbbf7d2121, host: 23.101.204.63: Pulling docker image
09:04:15 [DEBUG] Starting gateway ab110d95-7b83-4cec-b9dc-400255f3166e:azure:eastus:skylark-azure-9cc892cf8ddf47d5bf8f495602674bfc, host: 20.115.111.168: Starting gateway container ghcr.io/parasj/skylark:local-0ac2daa948e124df4ab7aa4d5c445816
09:04:16 [DEBUG] Starting gateway ab110d95-7b83-4cec-b9dc-400255f3166e:azure:eastus:skylark-azure-9cc892cf8ddf47d5bf8f495602674bfc, host: 20.115.111.168: Gateway started e1dfbaaa45ca7e22785516b162cefcb9c736419d053702c75b75b201cacf0f8d
09:04:20 [DEBUG] Starting gateway ab110d95-7b83-4cec-b9dc-400255f3166e:azure:westus:skylark-azure-4e90f72dd70146ab931f76fbbf7d2121, host: 23.101.204.63: Starting gateway container ghcr.io/parasj/skylark:local-0ac2daa948e124df4ab7aa4d5c445816
09:04:24 [DEBUG] Starting gateway ab110d95-7b83-4cec-b9dc-400255f3166e:azure:westus:skylark-azure-4e90f72dd70146ab931f76fbbf7d2121, host: 23.101.204.63: Gateway started 103a36fada2088adf5e66d505f0d09cf41cd31cbbfb2dd5a33e2f67beb5e967b
09:04:24 [DEBUG] Install gateway package on instances: 16.25s
09:04:24 [INFO]  Provisioned ReplicationTopologyGateway(region='azure:westus', instance=0): http://23.101.204.63:8888/container/103a36fada20
09:04:24 [INFO]  Provisioned ReplicationTopologyGateway(region='azure:eastus', instance=0): http://20.115.111.168:8888/container/e1dfbaaa45ca
Query object sizes (/test/direct_replication/99): 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 100/100 [00:00<00:00, 427.80it/s]
09:04:24 [DEBUG] Building chunk requests: 0.00s
09:04:24 [DEBUG] Sending 100 chunk requests to 20.115.111.168
09:04:25 [DEBUG] Dispatch chunk requests: 0.32s
09:04:25 [INFO]  0.39GByte replication job launched
Replication: average 0.25Gbit/s: 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 3.12G/3.12G [00:13<00:00, 251Mbit/s]
09:04:38 [INFO]  Replication completed in 12.68s (0.25Gbit/s)

Valid cloud region

When the assertion fails, we need a better mechanism to deal with it. Rite now, it just continues.

Traceback (most recent call last):
  File "scripts/setup_bucket.py", line 94, in <module>
    main(parse_args())
  File "scripts/setup_bucket.py", line 50, in main
    obj_store_interface_src = ObjectStoreInterface.create(args.src_region, src_bucket)
  File "/home/ubuntu/skylark/skylark/obj_store/object_store_interface.py", line 53, in create
    return AzureInterface(region_tag.split(":")[1], bucket)
  File "/home/ubuntu/skylark/skylark/obj_store/azure_interface.py", line 23, in __init__
    assert self.azure_region in azure_storage_credentials
AssertionError
Building docker image
[+] Building 3.1s (19/19) FINISHED

[bug] prlimit called from Mac OS

OS: Mac OS 10.15.7

Command: skylark replicate-random aws:us-east-1 aws:us-west-1

Error trace:

sudo: prlimit: command not found
Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.9/bin/skylark", line 33, in <module>
    sys.exit(load_entry_point('skylark', 'console_scripts', 'skylark')())
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/typer/main.py", line 214, in __call__
    return get_command(self)(*args, **kwargs)
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/typer/main.py", line 500, in wrapper
    return callback(**use_params)  # type: ignore
  File "/Users/asimbiswal/Desktop/Cal/RISELab/skylark/skylark/skylark/cli/cli.py", line 134, in replicate_random
    check_ulimit()
  File "/Users/asimbiswal/Desktop/Cal/RISELab/skylark/skylark/skylark/cli/cli_helper.py", line 241, in check_ulimit
    subprocess.check_output(increase_soft_limit)
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/subprocess.py", line 424, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/subprocess.py", line 528, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['sudo', 'prlimit', '--pid', '1581', '--nofile=1048576:1048576']' returned non-zero exit status 1.

[bug] Debug consistent appearance of stragglers

See the following output from a gateway log when running the basic test in the readme:

today at 8:06:39 PMProcess Process-55:
today at 8:06:39 PMTraceback (most recent call last):
today at 8:06:39 PM  File "/usr/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
today at 8:06:39 PM    self.run()
today at 8:06:39 PM  File "/usr/lib/python3.8/multiprocessing/process.py", line 108, in run
today at 8:06:39 PM    self._target(*self._args, **self._kwargs)
today at 8:06:39 PM  File "/pkg/skylark/gateway/gateway_sender.py", line 60, in worker_loop
today at 8:06:39 PM    self.send_chunks([next_chunk_id], dest_ip)
today at 8:06:39 PM  File "/pkg/skylark/gateway/gateway_sender.py", line 128, in send_chunks
today at 8:06:39 PM    sock.sendall(chunk_data)
today at 8:06:39 PMConnectionResetError: [Errno 104] Connection reset by peer

TCP connections should be persistent (not closed after each batch)

Based on my reading of the code, it looks like the connection isn't persistent --- the destination port is left listening, but a separate connection is still used for each batch of chunks.

I'm approving since it improves bandwidth, but I still think that keeping the TCP connection itself alive for a long time is good, to avoid ramping up congestion control each time.

Originally posted by @samkumar in #38 (review)

[meta] Tracking cloud provider permissioning

  • Common issues:
    • Need a better configuration engine stored in ~/.skylark/config.json
    • Generated AWS keys should be stored in ~/.ssh/skylark
    • SSH keys should be signed with a random hash stored in the config and all created instances should be tagged with that hash so multiple users can share a single AWS account
    • Tag all instances when they are in use and untag them after skylark runs
    • Automatically have gateways shut down. Upon provisioning, configure instances to terminate upon shutdown (instance-initiated-shutdown-behavior terminate).
    • skylark init should support disabling specific cloud providers if users do not have access
  • AWS:
    • Skylark should use its own IAM rather than using the user's
    • On skylark init, query for opt-in regions enabled and then disable them in AWS. No explicit list should be maintained.
    • Skylark should make a skylark VPC instead of using the default VPC
    • Skylark should use ECS-optimized container images (fixed by #152)
  • GCP:
    • Skylark should use Container OS optimized images (fixed by #153)
  • Azure:
    • config.json bugs
    • Use single skylark resource group instead of making a new resource group per VM #125 #146
    • Can we reuse more resources if possible rather than recreating them?
    • Azure storage permissions should automatically load instead of using pre-signed URLs #109
    • Test updated Azure interface to debug whether it solves resource group reliability issues

[bug] Provisioning instance type in unsupported availability zone

Triggered by running skylark replicate-random on main branch:

09:19:10 [DEBUG] Loaded gcp_project: None, azure_subscription: None
09:19:10 [WARN]  Warning: soft file limit is set to 1024, increasing for process with `sudo prlimit --pid 34921 --nofile=1048576:1048576`
09:19:17 [DEBUG] Cloud SSH key initialization: 6.51s
09:19:17 [WARN]  Instances will remain up and may result in continued cloud billing. Remember to call `skylark deprovision` to deprovision gateways.
09:19:21 [WARN]  Retrying start_instance due to An error occurred (Unsupported) when calling the RunInstances operation: Your requested instance type (m5.8xlarge) is not supported in your requested Availability Zone (us-east-1e). Please retry your request by not specifying an Availability Zone or choosing us-east-1a, us-east-1b, us-east-1c, us-east-1d, us-east-1f. (attempt 1/4)
09:19:23 [WARN]  Retrying start_instance due to An error occurred (Unsupported) when calling the RunInstances operation: Your requested instance type (m5.8xlarge) is not supported in your requested Availability Zone (us-east-1e). Please retry your request by not specifying an Availability Zone or choosing us-east-1a, us-east-1b, us-east-1c, us-east-1d, us-east-1f. (attempt 2/4)
09:19:27 [WARN]  Retrying start_instance due to An error occurred (Unsupported) when calling the RunInstances operation: Your requested instance type (m5.8xlarge) is not supported in your requested Availability Zone (us-east-1e). Please retry your request by not specifying an Availability Zone or choosing us-east-1a, us-east-1b, us-east-1c, us-east-1d, us-east-1f. (attempt 3/4)

[meta] Centralized logging of user transfers

We can gain a lot of insights about the kind of data users are moving by logging their workloads to a central analytics store. We should log things like:

  • size of transfer
  • file sizes
  • file extensions
  • regions transferred
  • achieved transfer speed

Status:

  • #180
  • Log transfer stats
  • Log files copied

GCP obj store write

Command: ./scripts/experiment.sh aws:us-east-1 gcp:us-east1-b

00:12:31 [INFO]  Provisioned ReplicationTopologyGateway(region='aws:us-east-1', instance=5): http://3.81.207.255:8888/container/34bae3e826c6
Query object sizes (fake_imagenet/validation-00127-of-00128): 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 1002/1002 [00:14<00:00, 67.88it/s]
00:12:46 [DEBUG] Building chunk requests: 0.00s
00:12:46 [DEBUG] Sending 154 chunk requests to 35.168.114.149
00:12:46 [DEBUG] Sending 154 chunk requests to 3.84.207.52
00:12:46 [DEBUG] Sending 154 chunk requests to 54.210.76.125
00:12:46 [DEBUG] Sending 154 chunk requests to 54.162.89.171
00:12:46 [DEBUG] Sending 386 chunk requests to 174.129.108.137
00:12:46 [DEBUG] Dispatch chunk requests: 0.17s
00:12:46 [INFO]  60.55GByte replication job launched
Replication: average 0.00Gbit/s:   0%|                                                                                         | 0.00/484G [00:20<?, ?bit/s]
00:13:06 [ERROR] No chunks completed after 20s! There is probably a bug, check logs. Exiting...
Replication: average 0.00Gbit/s:   0%|                                                                                         | 0.00/484G [00:20<?, ?bit/s]
{"total_runtime_s": 20.221991, "throughput_gbits": 0.0, "monitor_status": "timed_out", "success": false}
aws-us-east-1_gcp-us-east1-b_8_fake_imagenet

AWS bucket already exists

We should either delete the bucket after each experiment, or find a way to reuse the bucket. This might be happening since buckets need to be uniquely named.

(base) ubuntu@sky1:~/skylark/scripts$ ./experiment.sh aws:us-east-1 aws:us-west-1
experiments-skylark-us-east-1
experiments-skylark-us-west-1
data/plan/aws-us-east-1_aws-us-west-1_8_fake_imagenet.json
experiments-skylark-us-east-1
experiments-skylark-us-west-1
Traceback (most recent call last):
  File "setup_bucket.py", line 96, in <module>
    main(parse_args())
  File "setup_bucket.py", line 53, in main
    obj_store_interface_src.create_bucket()
  File "/home/ubuntu/skylark/skylark/obj_store/s3_interface.py", line 76, in create_bucket
    s3_client.create_bucket(Bucket=self.bucket_name)
  File "/home/ubuntu/miniconda/lib/python3.8/site-packages/botocore/client.py", line 391, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/home/ubuntu/miniconda/lib/python3.8/site-packages/botocore/client.py", line 719, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.errorfactory.BucketAlreadyExists: An error occurred (BucketAlreadyExists) when calling the CreateBucket operation: The requested bucket name is not
 available. The bucket namespace is shared by all users of the system. Please select a different name and try again.
./experiment.sh: line 41: scripts/pack_docker.sh: No such file or directory```

Skylark deprovision

Traceback (most recent call last):
  File "/home/ubuntu/miniconda/bin/skylark", line 33, in <module>
    sys.exit(load_entry_point('skylark', 'console_scripts', 'skylark')())
  File "/home/ubuntu/miniconda/lib/python3.8/site-packages/typer/main.py", line 214, in __call__
    return get_command(self)(*args, **kwargs)
  File "/home/ubuntu/miniconda/lib/python3.8/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/home/ubuntu/miniconda/lib/python3.8/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/home/ubuntu/miniconda/lib/python3.8/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/ubuntu/miniconda/lib/python3.8/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/ubuntu/miniconda/lib/python3.8/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/home/ubuntu/miniconda/lib/python3.8/site-packages/typer/main.py", line 500, in wrapper
    return callback(**use_params)  # type: ignore
  File "/home/ubuntu/skylark/skylark/cli/cli.py", line 282, in deprovision
    deprovision_skylark_instances(azure_subscription=azure_subscription, gcp_project_id=gcp_project)
  File "/home/ubuntu/skylark/skylark/cli/cli_helper.py", line 267, in deprovision_skylark_instances
    instances += gcp.get_matching_instances()
  File "/home/ubuntu/skylark/skylark/compute/gcp/gcp_cloud_provider.py", line 164, in get_matching_instances
    instances: List[Server] = super().get_matching_instances(**kwargs)
  File "/home/ubuntu/skylark/skylark/compute/cloud_providers.py", line 60, in get_matching_instances
    if not all(instance.tags().get(k, "") == v for k, v in tags.items()):
  File "/home/ubuntu/skylark/skylark/compute/cloud_providers.py", line 60, in <genexpr>
    if not all(instance.tags().get(k, "") == v for k, v in tags.items()):
  File "/home/ubuntu/miniconda/lib/python3.8/site-packages/cachetools/__init__.py", line 520, in wrapper
    v = func(*args, **kwargs)
  File "/home/ubuntu/skylark/skylark/compute/gcp/gcp_server.py", line 84, in tags
    return self.get_instance_property("labels")
  File "/home/ubuntu/skylark/skylark/compute/gcp/gcp_server.py", line 60, in get_instance_property
    return self.get_gcp_instance()[prop]
KeyError: 'labels'

File transfer

With the latest commit, even for 10 chuncks of 1 MB each for us-east<->us-west transfer the first 9 chunks are relayed in ~2 seconds. For the last chunk relay does not complete in one case after even 42 mins.

To recreate:
python skylark/test/test_replicator_client.py --gateway-docker-image $SKYLARK_DOCKER_IMAGE --skip-upload --n-chunks 10 --chunk-size-mb 1 --num-gateways 1 --src-region aws:us-east-1 --dest-region aws:us-west-1 --gcp-instance-class None

#35 @parasj

Credentials from file

On the remote gateway, the configs are not being returned.
For example, this line here throws up a key not-found error.

Credential and Authentication

As of now all the keys are stored at multiple locations. Most are in data/config.json but some are also in other locations (e.g., gcp key, azure blob keys). Further, each sub-module tries to authorize each time. We need to a) Consolidate all the keys, b) Have a uniform way of reading keys (input param vs environment variable vs ? ). c) Get rid of code duplication and have one module that deals with authentication.

This warrants some discussion, given we have seen some techniques of authentication to be buggy previously.

[Solved] Azure Storage Blob access exception

If you are unable to access the blob (are able to create a container, but unable to add blobs to the container) add the following two permissions : Storage Blob Data Contributor and Storage Queue Data Contributor to the skylark Azure app through the Access Control (IAM) page on the Container's webpage.

Exception:

Azure Blob Storage v12.9.0
Creating container:43b77945-de57-4980-a082-fb21979e81eb
Uploading to Azure Storage as blob:
        test.txt
Exception:
This request is not authorized to perform this operation using this permission.
RequestId:509f5a35-d01e-001c-6304-2929af000000
Time:2022-02-23T22:30:39.3449575Z
ErrorCode:AuthorizationPermissionMismatch

Issue with running skylark on fresh instance

$ skylark
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 584, in _build_master
    ws.require(__requires__)
  File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 901, in require
    needed = self.resolve(parse_requirements(requirements))
  File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 792, in resolve
    raise VersionConflict(dist, req).with_context(dependent_req)
pkg_resources.ContextualVersionConflict: (Click 7.0 (/usr/lib/python3/dist-packages), Requirement.parse('click>=7.1.2'), {'flask'})

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ubuntu/.local/bin/skylark", line 6, in <module>
    from pkg_resources import load_entry_point
  File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 3254, in <module>
    def _initialize_master_working_set():
  File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 3237, in _call_aside
    f(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 3266, in _initialize_master_working_set
    working_set = WorkingSet._build_master()
  File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 586, in _build_master
    return cls._build_from_requirements(__requires__)
  File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 599, in _build_from_requirements
    dists = ws.resolve(reqs, Environment())
  File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 787, in resolve
    raise DistributionNotFound(req, requirers)
pkg_resources.DistributionNotFound: The 'click>=7.1.2' distribution was not found and is required by flask

GCP Firewall issue

Trace

Traceback (most recent call last):
  File "skylark/test/test_replicator_client.py", line 130, in <module>
    main(parse_args())
  File "skylark/test/test_replicator_client.py", line 89, in main
    rc = ReplicatorClient(
  File "/home/ubuntu/skylark/skylark/replicate/replicator_client.py", line 60, in __init__
    do_parallel(lambda fn: fn(), jobs)
  File "/home/ubuntu/skylark/skylark/utils/utils.py", line 67, in do_parallel
    args, result = future.result()
  File "/home/ubuntu/miniconda/lib/python3.8/concurrent/futures/_base.py", line 437, in result
    return self.__get_result()
  File "/home/ubuntu/miniconda/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
    raise self._exception
  File "/home/ubuntu/miniconda/lib/python3.8/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/home/ubuntu/skylark/skylark/utils/utils.py", line 60, in wrapped_fn
    return args, func(args)
  File "/home/ubuntu/skylark/skylark/replicate/replicator_client.py", line 60, in <lambda>
    do_parallel(lambda fn: fn(), jobs)
  File "/home/ubuntu/skylark/skylark/compute/gcp/gcp_cloud_provider.py", line 167, in configure_default_firewall
   op = compute.firewalls().insert(project=self.gcp_project, body=fw_body).execute()
  File "/home/ubuntu/miniconda/lib/python3.8/site-packages/googleapiclient/_helpers.py", line 131, in positional
_wrapper
    return wrapped(*args, **kwargs)
  File "/home/ubuntu/miniconda/lib/python3.8/site-packages/googleapiclient/http.py", line 937, in execute       
    raise HttpError(resp, content, uri=self.uri)
googleapiclient.errors.HttpError: <HttpError 404 when requesting https://compute.googleapis.com/compute/v1/proje
cts/skylark-shishir/global/firewalls?alt=json returned "The resource 'projects/skylark-shishir/global/networks/d
efault' was not found". Details: "[{'message': "The resource 'projects/skylark-shishir/global/networks/default' 
was not found", 'domain': 'global', 'reason': 'notFound'}]">

To recreate:

python skylark/test/test_replicator_client.py --gateway-docker-image $SKYLARK_ DOCKER_IMAGE --skip-upload --n-chunks 512 --chunk-size-mb 10 --num-gateways 1 --src-region a ws:us-east-1 --dest-region aws:us-west-1 --gcp-instance-class n2 --gcp-project skylark-shishir

Azure gateway spin-ups

test_replicator_client.py has now been modified to read azure subscription from OS variable similar to the cli. It has an option to override the OS if a subscription has been input from the command line --azure-subscription when invoking test_replicator. So, to default back and read from OS variable, we need to not input the subscription. args.azure-subscription then defaults to None. However, this raises the following issue.

2022-01-28 02:54:38.316 | INFO | __main__:main:178 - Provisioning gateway instances Traceback (most recent call last): File "skylark/test/test_replicator_client.py", line 206, in <module> main(parse_args()) File "skylark/test/test_replicator_client.py", line 179, in main rc.provision_gateways( File "/home/ubuntu/skylark/skylark/replicate/replicator_client.py", line 76, in provision_gateways assert len(azure_regions_to_provision) ==0, "Azure not enabled" AssertionError: Azure not enabled

Command to replicate:
python skylark/test/test_replicator_client.py --gateway-docker-image $SKYLARK_DOCKER_IMAGE --n-chunks 100 --chunk-size-mb 4 --num-gateways 1 --src-region azure:eastus --dest-region azure:westus --gcp-project skylark-shishir --bucket-prefix t11

Commit ID: f32b4db of #100

Workaround for now: just include --azure-subscription

[meta] Clear error reporting + logging

At the moment, errors are consumed silently and users have to SSH into an instance or use the log viewer to query the state of the gateway.

Upon exiting, logs from the gateway should be copied back to the replicator client to /tmp alongside the profile and saved logs from the replicator client. We should save all state including the command they ran. This will make debugging much easier since users can send us these logs via a github issue.

Linked issue: #163

Azure timeout error

Command: skylark replicate-random aws:us-east-1 azure:eastus -n 10 -s $((1024 * 68)) --chunk-size-mb 68

Docker: SKYLARK_DOCKER_IMAGE=ghcr.io/parasj/skylark:local-a3059934af58de4641a93cc87cab0219

02:56:47 [DEBUG] Starting gateway ab110d95-7b83-4cec-b9dc-400255f3166e:azure:eastus:skylark-azure-0e5e1e80daff456f828285adc72f0e6b, host: 52.142.46.50: Gate
way started b6778e3a8693c9520d354c1e1da621621a38da646744385df8466c5f91ed3ba7
02:56:48 [DEBUG] Starting gateway aws:us-east-1:i-0e9e40d93ee201d79, host: 52.55.133.29: Gateway started 0c1c5df09e1d913b78a760e7e754b10df63e5c1fe3bfa9487be
435e35bd5b735
02:58:54 [DEBUG] Install gateway package on instances: 138.89s
Traceback (most recent call last):
  File "/home/ubuntu/miniconda/bin/skylark", line 33, in <module>
    sys.exit(load_entry_point('skylark', 'console_scripts', 'skylark')())
  File "/home/ubuntu/miniconda/lib/python3.8/site-packages/typer/main.py", line 214, in __call__
    return get_command(self)(*args, **kwargs)
  File "/home/ubuntu/miniconda/lib/python3.8/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/home/ubuntu/miniconda/lib/python3.8/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/home/ubuntu/miniconda/lib/python3.8/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/ubuntu/miniconda/lib/python3.8/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/ubuntu/miniconda/lib/python3.8/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/home/ubuntu/miniconda/lib/python3.8/site-packages/typer/main.py", line 500, in wrapper
    return callback(**use_params)  # type: ignore
  File "/home/ubuntu/skylark/skylark/cli/cli.py", line 162, in replicate_random
rc.provision_gateways(reuse_gateways)
  File "/home/ubuntu/skylark/skylark/replicate/replicator_client.py", line 183, in provision_gateways
    do_parallel(lambda arg: arg[0].start_gateway(arg[1], gateway_docker_image=self.gateway_docker_image), args, n=-1)
  File "/home/ubuntu/skylark/skylark/utils/utils.py", line 68, in do_parallel
    args, result = future.result()
  File "/home/ubuntu/miniconda/lib/python3.8/concurrent/futures/_base.py", line 437, in result
    return self.__get_result()
  File "/home/ubuntu/miniconda/lib/python3.8/concurrent/futures/_base.py", line 389, in __get_result
    raise self._exception
  File "/home/ubuntu/miniconda/lib/python3.8/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/home/ubuntu/skylark/skylark/utils/utils.py", line 61, in wrapped_fn
    return args, func(args)
  File "/home/ubuntu/skylark/skylark/replicate/replicator_client.py", line 183, in <lambda>
    do_parallel(lambda arg: arg[0].start_gateway(arg[1], gateway_docker_image=self.gateway_docker_image), args, n=-1)
  File "/home/ubuntu/skylark/skylark/compute/server.py", line 203, in start_gateway
    check_stderr(self.run_command(make_sysctl_tcp_tuning_command(cc="bbr" if use_bbr else "cubic")))
  File "/home/ubuntu/skylark/skylark/compute/server.py", line 174, in run_command
    client = self.ssh_client
  File "/home/ubuntu/skylark/skylark/compute/server.py", line 101, in ssh_client
    self.client = self.get_ssh_client_impl()
  File "/home/ubuntu/skylark/skylark/compute/azure/azure_server.py", line 151, in get_ssh_client_impl
    ssh_client.connect(
  File "/home/ubuntu/miniconda/lib/python3.8/site-packages/paramiko/client.py", line 349, in connect
    retry_on_signal(lambda: sock.connect(addr))
  File "/home/ubuntu/miniconda/lib/python3.8/site-packages/paramiko/util.py", line 283, in retry_on_signal
    return function()
  File "/home/ubuntu/miniconda/lib/python3.8/site-packages/paramiko/client.py", line 349, in <lambda>
    retry_on_signal(lambda: sock.connect(addr))
TimeoutError: [Errno 110] Connection timed out

Standardize on Gibibytes, not Gigabytes

At the moment, all storage is reported as gigabytes (10^9 bytes), not a gibibyte (2^30 bytes). AWS, GCP and Azure all refer to a GB as a GiB (1024 * 1024 * 1024 bytes).

[bug] `replicate-json` broken by some plans: AssertionError: 2 batches, expected 3

replicate-json broken by some plans

Query object sizes (fake_imagenet/validation-00127-of-00128): 100%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆ| 1152/1152 [00:14<00:00, 81.29it/s]
Traceback (most recent call last):
  File "/home/ubuntu/miniconda3/bin/skylark", line 33, in <module>
    sys.exit(load_entry_point('skylark', 'console_scripts', 'skylark')())
  File "/home/ubuntu/miniconda3/lib/python3.8/site-packages/typer/main.py", line 214, in __call__
    return get_command(self)(*args, **kwargs)
  File "/home/ubuntu/miniconda3/lib/python3.8/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/home/ubuntu/miniconda3/lib/python3.8/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/home/ubuntu/miniconda3/lib/python3.8/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/ubuntu/miniconda3/lib/python3.8/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/ubuntu/miniconda3/lib/python3.8/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/home/ubuntu/miniconda3/lib/python3.8/site-packages/typer/main.py", line 500, in wrapper
    return callback(**use_params)  # type: ignore
  File "/home/ubuntu/skylark/skylark/cli/cli.py", line 277, in replicate_json
    job = rc.run_replication_plan(job)
  File "/home/ubuntu/skylark/skylark/replicate/replicator_client.py", line 242, in run_replication_plan
    assert len(chunk_batches) == len(src_instances), f"{len(chunk_batches)} batches, expected {len(src_instances)}"
AssertionError: 2 batches, expected 3
sarah-skylark-europe-north1-a fake_imagenet/train-00000-of-01024

Command:

# create plan
skylark solver solve-throughput gcp:europe-north1-a  gcp:us-west4-a 16  -o plan.json --max-instances 16;

# run replication (obj store)
skylark replicate-json ${filename} \
--gcp-project skylark-sarah \
 --source-bucket $src_bucket \
--dest-bucket $dest_bucket \
--key-prefix fake_imagenet > data/results/${experiment}/obj-store-logs.txt

AZ cleanup

(base) ubuntu@sky1:~/skylark$ bash scripts/az_cleanup.sh If you want to filter this list to a subset, pass an argument as a prefix: az_cleanup.sh <prefix> scripts/az_cleanup.sh: line 10: jq: command not found Exception ignored in: <_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF-8'> BrokenPipeError: [Errno 32] Broken pipe No groups found

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.