jmgaljaard / fltk-testbed Goto Github PK

View Code? Open in Web Editor NEW

8.0 8.0 62.0 1.63 MB

License: BSD 2-Clause "Simplified" License

Dockerfile 0.19% Python 77.35% Jupyter Notebook 16.00% HCL 5.39% Smarty 1.08%

fltk-testbed's People

Contributors

Stargazers

Watchers

fltk-testbed's Issues

Experiment Replication

Feature Request

Currently the deployment is fire-and-forget, without any replication possibility. A nice feature would be to make it possible to define an experiment configuration one time. Then allow the orchestrator to run multiple times with different seeds, to allow for easy replication of different experiment types.

SimulatedArrivalGenerator --> Seed is used to pseudo randomly create the workload. Any random draws are further sampled using this random seed, e.g. inter-arrival-times, and seed for the jobs that are deployed.
SequentialArrivalGenerator --> Seed is used within the deployment, to ensure that results are obtained in a repeatable fashion.

Mss

Bug Report

Current Behavior
A clear and concise description of the behavior.

Input Code

REPL or Repo link if applicable:

var your => (code) => here;

Expected behavior/code
A clear and concise description of what you expected to happen (or code).

FLTK Configuration (execution config, system parameters, hyper-parameters, etc.)

{
  "your": { "config": "here" }
}

Environment

Python version: [e.g. 3.7]
PyTorch version: [e.g. 1.9.1]
OS: [e.g. OSX 10.13.4, Windows 10]
Kubernetes version: [e.g v1.22]
Platform: [e.g. minikube, GKE, AWS]

Possible Solution

Additional context/Screenshots
Add any other context about the problem here. If applicable, add screenshots to help explain.

Download of datasets fails because of wrong parameter names

Bug Report

Current Behavior
Running python3 -m fltk extractor ./configs/example_cloud_experiment.json from the project root, after having successfully installed requirements-cpu.txt, fails with the following message:

/fltk-testbed/venv/lib/python3.8/site-packages/dataclasses_json/core.py:171: RuntimeWarning: `NoneType` object value of non-optional type config_path detected when decoding DistributedConfig.
  warnings.warn(f"`NoneType` object {warning}.", RuntimeWarning)
No argument path is provided.
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/fltk-testbed/fltk/__main__.py", line 93, in <module>
    __main__()
  File "/fltk-testbed/fltk/__main__.py", line 74, in __main__
    __run_op_dict[args.action](arg_path, conf_path,
  File "/fltk-testbed/fltk/launch.py", line 122, in launch_extractor
    download_datasets(args, conf)
  File "/fltk-testbed/fltk/core/distributed/extractor.py", line 22, in download_datasets
    data_path = config.get_data_path()
AttributeError: 'NoneType' object has no attribute 'get_data_path'

Expected behavior/code
Datasets should be downloaded to data dir.

Environment

Python version: 3.8
Platform: Windows (WSL 2)

Possible Solution
Seems to be caused by a mismatch between the parameter name for the configuration.
Provided (config):

fltk-testbed/fltk/__main__.py

Lines 74 to 81 in f8fb18d

 __run_op_dict[args.action](arg_path, conf_path, 

 rank=_save_get(args, 'rank'), 

 parser=parser, 

 nic=_save_get(args, 'nic'), 

 host=_save_get(args, 'host'), 

 prefix=_save_get(args, 'prefix'), 

 args=args, 

 config=distributed_config)

Expected (conf):

fltk-testbed/fltk/launch.py

Lines 111 to 122 in f8fb18d

 def launch_extractor(base_path: Path, config_path: Path, args: Namespace = None, conf: DistributedConfig = None, 

 **kwargs): 

 """ 

  Extractor launch function, will only download all models and quit execution. 

  @param args: Arguments passed from CLI. 

  @type args: Namespace 

  @param conf: Parsed configuration file passed from the CLI. 

  @type conf: Optional[DistributedConfig] 

  @return: None 

  @rtype: None 

  """ 

 download_datasets(args, conf)

Default nodepool spinning down directly

Bug Report

Current Behavior
The default node pool spins down directly not allowing terraform to apply the plan.

Environment

Python version: 3.8.3
PyTorch version: 1.12.1
OS: OSX 10.15.7
Kubernetes version: v1.25.0

Possible Solution
As suggested by Jeroen, either using the gcloud command to scale up the node pool, or setting initial_node_count in terraform-gke/main.tf to something other than 0 for default-node-pool.

Merge collaborative efforts

Feature Request

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I have an issue when [...]

Describe the solution you'd like
A clear and concise description of what you want to happen. Add any considered drawbacks.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Teachability, Documentation, Adoption, Migration Strategy
If you can, explain how users will be able to use this and possibly write out a version the docs.
Maybe a screenshot or design?

Re-introduce DataParallel training

Bug Report

Current version of KFLTK does only support federated learning experiments. However, it should also support DataParallel (DistributedDataParallel) experiments. To resolve this the orchestrator and launch file needs to be modified slightly.

Missing fltk.util.env

Bug Report

In line 25 of launch.py, fltk.util.env is imported but cannot be found.

Logging 'off-by-one'

This weird 'off-by-one' error must be addressed in a seperate issue, using a dedicated logging/retrieval function

Originally posted by @JMGaljaard in #41 (comment)

Add scheduler to deployment

Feature Request

Currently, the training jobs are deployed without considering the availability of resources during deployment. As a result, deploying multiple jobs at once can result in starvation of resources.

Describe the solution you'd like
Adding a scheduler like vulcano/other gang scheduler would resolve this sisue.

Describe alternatives you've considered
NA

Write kustomize patch to be applied (https://kubectl.docs.kubernetes.io/references/kustomize/kustomization/patches/)
Test patch.
Update terraform deployment to add patch during deployment.

Deduplicate configuration files

Feature Request

Is your feature request related to a problem? Please describe.
Currently due to the combination of federated learning experiments created by @bacox and the initial distributed learning implementation there exists some duplicate/parallel code. As a result, the code base may become more easily interpretable by cleaning up this duplication.

Describe the solution you'd like
Unify the distributed learners and federated learners, including their configuration files.

Describe alternatives you've considered

Teachability, Documentation, Adoption, Migration Strategy

Number of epochs seemingly not honoured by client training function

Bug Report

Current Behavior
Regardless of the value of the num_epochs parameter in the train method from fltk/core/client.py, it appears that only a single epoch is run.

Input Code
Given dataset cifar10 and network Cifar10CNN, the following is a condensed version of a client's logs for a training round when num_epochs = 1:

[client1] [1,     0] loss: 0.229
[client1] [1,    10] loss: 2.298
[client1] [1,    20] loss: 2.287
[client1] [1,    30] loss: 2.277
[client1] [1,    40] loss: 2.269
Train duration is 89.58029103279114 seconds
Test duration is 7.030742883682251 seconds

Setting num_epochs = 3 then gives:

[client1] [3,     0] loss: 0.230
[client1] [3,    10] loss: 2.299
[client1] [3,    20] loss: 2.288
[client1] [3,    30] loss: 2.274
[client1] [3,    40] loss: 2.258
Train duration is 95.99584817886353 seconds
Test duration is 7.070667266845703 seconds

Both versions take a similar amount of time and they seem to be doing roughly the same one epoch's worth of work.

Expected behavior/code
Changes to the num_epochs param should determine the number of epochs executed by the training call.

ResNet not working? CNN works fine

Hi,

I was running some experiments on Gcloud and for some reason, ResNet and VGG does not work. I thought it was my docker issue at first regarding container, but when I changed to CNN, it works perfectly fine.

  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/federation-lab/fltk/__main__.py", line 66, in <module>
    __main__()
  File "/opt/federation-lab/fltk/__main__.py", line 34, in __main__
    client_start(arguments, config)
  File "/opt/federation-lab/fltk/__main__.py", line 57, in client_start
    launch_client(task_id, config=configuration, learning_params=learning_params, namespace=args)
  File "/opt/federation-lab/fltk/launch.py", line 54, in launch_client
    epoch_data = client.run_epochs()
  File "/opt/federation-lab/fltk/client.py", line 208, in run_epochs
    train_loss = self.train(epoch)
  File "/opt/federation-lab/fltk/client.py", line 135, in train
    outputs = self.model(inputs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/federation-lab/fltk/nets/fashion_mnist_resnet.py", line 60, in forward
    y = self.block3(y)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/container.py", line 139, in forward
    input = module(input)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/pooling.py", line 615, in forward
    return F.avg_pool2d(input, self.kernel_size, self.stride,
RuntimeError: Given input size: (512x1x1). Calculated output size: (512x0x0). Output size is too small

This is a log from FashionMNISTResNet.

I also get an error of "Backoff Restarting failed container" when I check kubernetes dashboard.

Is there sth I should modify before running ResNet code? Could you look into this?

Missing step for pushing docker image

Bug Report

Current Behavior
When attempting to push the fltk docker image with a clean gcloud installation, there is an error related to authentication.

Expected behavior/code
The docker image should be pushed successfully.

Environment

OS: Manjaro Linux
Google Cloud SDK 401.0.0
alpha 2022.09.03
beta 2022.09.03
bq 2.0.75
bundled-python3-unix 3.9.12
core 2022.09.03
gcloud-crc32c 1.0.0
gsutil 5.12

Possible Solution
Instruct the user to run the command gcloud auth configure-docker before pushing the image.

Cannot retrieve helm chart NFS provisioner+providder

Bug Report

Current Behavior
Currently, the helm provider issues an error when the helm chart is not available locally. This is likely introduced in #52.

Expected behavior/code

No issues occur when trying to deploy the NFS provisioner.

Environment

Python version: -
PyTorch version:-
OS: -
Kubernetes version: -
Platform: GKE/minikube

Possible Solution

rolling back some of the changes in #52, to utilize an url instead of archive will likely resolve the issue.

Additional context/Screenshots
Add any other context about the problem here. If applicable, add screenshots to help explain.

Logging is not saved during federated learning experiments

Bug Report

Current Behavior
The default logging directory used by the Federator in experiments is written to a directory that is not mounted by a PVC during remote training. As a result experiment data will be lost after training.

Expected behavior/code
Logging data is always written to to a non-conflicting directory or file after/during training when running in a cluster.

Environment

Python version: 3.9
PyTorch version: 1.9.1
OS: *
Kubernetes version: All
Platform: Minikube, GKE cluster

Possible Solution
Add logging directory path to the Federator.

Additional context/Screenshots

Circular import

Bug Report

It seems that there are few circular import problems.

For example, dataset.py imports LearningParameters from fltk.util.config.arguments. But fltk.util.config.arguments imports Dataset from dataset.py again.

The circular import will result in the error below when downloading datasets.

Traceback (most recent call last):
  File "/usr/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/cy/MyGit/fltk-testbed-group-1/fltk/__main__.py", line 9, in <module>
    from fltk.launch import launch_extractor, launch_client, launch_single, \
  File "/home/cy/MyGit/fltk-testbed-group-1/fltk/launch.py", line 15, in <module>
    from fltk.core.client import Client
  File "/home/cy/MyGit/fltk-testbed-group-1/fltk/core/client.py", line 8, in <module>
    from fltk.core.node import Node
  File "/home/cy/MyGit/fltk-testbed-group-1/fltk/core/node.py", line 7, in <module>
    from fltk.datasets.loader_util import get_dataset
  File "/home/cy/MyGit/fltk-testbed-group-1/fltk/datasets/__init__.py", line 1, in <module>
    from .cifar10 import CIFAR10Dataset
  File "/home/cy/MyGit/fltk-testbed-group-1/fltk/datasets/cifar10.py", line 5, in <module>
    from .dataset import Dataset
  File "/home/cy/MyGit/fltk-testbed-group-1/fltk/datasets/dataset.py", line 7, in <module>
    from fltk.util.config.arguments import LearningParameters
  File "/home/cy/MyGit/fltk-testbed-group-1/fltk/util/config/arguments.py", line 11, in <module>
    from fltk.datasets import CIFAR10Dataset, FashionMNISTDataset, CIFAR100Dataset, MNIST
ImportError: cannot import name 'CIFAR10Dataset' from partially initialized module 'fltk.datasets' (most likely due to a circular import) (/home/cy/MyGit/fltk-testbed-group-1/fltk/datasets/__init__.py)

Orchestrator helm chart dependency on extractor helm chart not indicated in notebook

Current Behavior
In the orchestrator helm chart, in fl-server-pod.yaml, a PVC named fl-server-claim is established. However, this PV is created by the extractor chart.

The jupyter notebooks contain no instructions on applying the extractor helm chart. This thus results in an error due to the fl-server pod referring to a non-existent PV.

Expected behavior/code
The jupyter notebook instructs users on how to apply the extractor chart and thus avoid this error.

Some FL experiment (learning) parameters not propagated from config file

Bug Report

Current Behavior
The value of some learning parameters (e.g., clients per round and epochs) provided in the config of an experiment is seemingly not correctly propagated to the orchestrator (and, subsequently, federator). It appears that the default value from fltk/util/learning_config.py's FedLearningConfig (e.g. clients_per_round: int = 2 and epochs: int = 1) is always used. The issue might also affect other parameters, although I have not experimented with all of them.

Input Code
Given configs/federated_tasks/example_arrival_config.json:

[
  {
    "type": "federated",
    "jobClassParameters": {
      "networkConfiguration": {
        "network": "FashionMNISTCNN",
        "lossFunction": "CrossEntropyLoss",
        "dataset": "mnist"
      },
      "systemParameters": {
        "dataParallelism": null,
        "configurations": {
          "Master": {
            "cores": "1000m",
            "memory": "1Gi"
          },
          "Worker": {
            "cores": "750m",
            "memory": "1Gi"
          }
        }
      },
      "hyperParameters": {
        "default": {
          "batchSize": 128,
          "testBatchSize": 128,
          "learningRateDecay": 0.0002,
          "optimizerConfig": {
            "type": "SGD",
            "learningRate": 0.01,
            "momentum": 0.1
          },
          "schedulerConfig": {
            "schedulerStepSize": 50,
            "schedulerGamma": 0.5,
            "minimumLearningRate": 1e-10
          }
        },
        "configurations": {
          "Master": null,
          "Worker": {
            "batchSize": 500,
            "optimizerConfig": {
              "learningRate": 0.05
            },
            "schedulerConfig": {
              "schedulerStepSize": 2000
            }
          }
        }
      },
      "learningParameters": {
        "totalEpochs": 5,
        "rounds": 1,
        "epochsPerRound": 3,
        "cuda": false,
        "clientsPerRound": 1,
        "dataSampler": {
          "type": "uniform",
          "qValue": 0.07,
          "seed": 42,
          "shuffle": true
        },
        "aggregation": "FedAvg"
      },
      "experimentConfiguration": {
        "randomSeed": [
          89
        ],
        "workerReplication": {
          "Master": 1,
          "Worker": 1
        }
      }
    }
  }
]

Run helm install flearner charts/orchestrator --namespace test -f charts/fltk-values-abel.yaml --set-file orchestrator.experiment=./configs/federated_tasks/example_arrival_config.json,orchestrator.configuration=./configs/example_cloud_experiment.json

Expected behavior/code
The values of the given config should be correctly reflected within the config_dict of fltk/core/distributed/orchestrator.py and self.config of fltk/core/federator.py after their initialization.

File permissions during deployment

fltk-testbed/Dockerfile

Line 1 in 89c100a

FROM bitnami/pytorch:1.12.1

Currently there seems to be an issue with the used docker container, as files become inaccessible. Reverting some of the changes, or setting the user using

USER 0

Resolves this.

Alternatively, adding securityContext configuration to deployments seems to alleviate the problem also

	__run_op_dict[args.action](arg_path, conf_path,
	rank=_save_get(args, 'rank'),
	parser=parser,
	nic=_save_get(args, 'nic'),
	host=_save_get(args, 'host'),
	prefix=_save_get(args, 'prefix'),
	args=args,
	config=distributed_config)

	def launch_extractor(base_path: Path, config_path: Path, args: Namespace = None, conf: DistributedConfig = None,
	**kwargs):
	"""
	Extractor launch function, will only download all models and quit execution.
	@param args: Arguments passed from CLI.
	@type args: Namespace
	@param conf: Parsed configuration file passed from the CLI.
	@type conf: Optional[DistributedConfig]
	@return: None
	@rtype: None
	"""
	download_datasets(args, conf)

jmgaljaard / fltk-testbed Goto Github PK

fltk-testbed's People

Contributors

Stargazers

Watchers

Forkers

fltk-testbed's Issues

Feature Request

Bug Report

Bug Report

Bug Report

Feature Request

Bug Report

Bug Report

Feature Request

Feature Request

Describe alternatives you've considered

Teachability, Documentation, Adoption, Migration Strategy

Bug Report

Bug Report

Bug Report

Bug Report

Additional context/Screenshots

Bug Report

Bug Report

Recommend Projects

Recommend Topics

Recommend Org

Jobs