GithubHelp home page GithubHelp logo

alibaba / federatedscope Goto Github PK

View Code? Open in Web Editor NEW
1.2K 14.0 198.0 307.22 MB

An easy-to-use federated learning platform

Home Page: https://www.federatedscope.io

License: Apache License 2.0

Python 91.57% Shell 5.35% Dockerfile 0.68% Jupyter Notebook 2.40%
federated-learning machine-learning pytorch

federatedscope's Issues

Fail to run the standalone example with the minimal version of requirement

An Error happens in FederatedScope/federatedscope/cv/dataset/leaf_cv.py:

When it is required to download the dataset, a function download_url is imported from the module torch_geometric.data, which is not included in the minimal version of the requirement, (of course it shouldn't be included in the minimal version, in my opinion).

So maybe we should replace the download_url with another implementation here. Thanks :)

Cannot import federatedscope

Although I have successfully completed the installation procedure, I cannot import federatedscope from path other than the root of the cloned repo.

Unexpected behavior when checking the sample_client_num

sample_client_num_valid = (0 < cfg.federate.sample_client_num <=
cfg.federate.client_num)

The term cfg.federate.client_num is allowed to be 0, which implies that the client_num would be determined by the dataset after the loading process. However, the above assertion happens before data loading and thus causes some unexpected behaviors, such as forcibly setting sample_client_num to the same value as cfg.federate.client_num:

if non_sample_case or not sample_cfg_valid:
# (a) use all clients
cfg.federate.sample_client_num = cfg.federate.client_num

Is there some complete tutorial about how to put into myself data/model/trainer and run a FL task?

Hi, I am trying to follow up on the guidance here: https://federatedscope.io/docs/own-case/, to add the data/model/trainer etc., into the project and run up an FL task.
However, this guidance is not that clear, particularly, 1). what does the config section work; 2). how to unify the customized data/model/trainer/config together to complete an FL task.
Is that possible to have complete guidance by using something like mnist/cifar10?

A keyword indicates the task type

Is it possible to add a keyword, such as cfg.federate.task_type, to indicate the task type of each client? It is useful in calculating the loss function because y_true should be long and float respectively for classification and regression tasks.

Confusing caliber

'Results_raw': {'client_individual': {'val_loss': 0.7106942534446716, 'test_loss': 0.7106942534446716, 'test_avg_loss': 0.7106942534446716, 'test_total': 1.0, 'val_avg_loss': 0.7106942534446716, 'val_total': 1.0}, 'client_summarized_weighted_avg': {'val_loss'

It is difficult for users to know client_individual means the best individual results.

A gRPC-related suggestion

In the current main branch (commit 954322c), the implementation of the gRPCCommManager class (see federatedscope/core/communication.py) largely refers to that in FedML (see https://github.com/FedML-AI/FedML/blob/master/fedml_core/distributed/communication/gRPC/grpc_comm_manager.py in commit 0fb63dd157e55ee603b7049568bf4c4ed0586e71), as commented in FederatedScope's codebase. This class is based on gRPC, a modern open source high performance Remote Procedure Call (RPC) framework. A gRPCCommManager (i) keeps addresses of potential message receivers in a dict/list collection and (ii) has wrapper functions that call APIs in gRPC for message sending and receiving. Many similar variants of such wrapper functions have been widely adopted in related packages. Although FederatedScope obeys the Apache-2.0 License and had a declaration of FedML's copyright, in order to avoid the risk of unintended infringement and unnecessary disputes with FedML, we re-implement this class by refering to the examples in gRPC tutorial in this commit.

in Quick Start, build docker image and run with docker env command error

in Quick Start, build docker image and run with docker env command error,
in docs ,the command is {docker run --gpus device=all --rm --it --name "fedscope" -w $(pwd) alibaba/federatedscope:base-env-torch1.10 /bin/bash"},
it should be {docker run --gpus device=all --rm -it --name "fedscope" -w $(pwd) alibaba/federatedscope:base-env-torch1.10 /bin/bash"} or {docker run --gpus device=all --rm -i -t --name "fedscope" -w $(pwd) alibaba/federatedscope:base-env-torch1.10 /bin/bash"}

report cuda error when trying to launch up the demo case

Hi when I am trying to launch up the demo case, cuda relevant error was reported as below:

I am using conda to manage the environment. in other env I have the pytorch works on cuda without any problem.
I think this could be the installation issue-- I did not install anything by myself, totally following your guidance.
My cuda version:
NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6
and my torch version: 1.10.1

(fedscope) liangma@lMa-X1:~/prj/FederatedScope$ python federatedscope/main.py --cfg federatedscope/example_configs/femnist.yaml

...
2022-05-13 22:06:09,249 (server:520) INFO: ----------- Starting training (Round #0) -------------
Traceback (most recent call last):
 File "/home/liangma/prj/FederatedScope/federatedscope/main.py", line 41, in <module>
   _ = runner.run()
 File "/home/liangma/miniconda3/envs/fedscope/lib/python3.9/site-packages/federatedscope-0.1.0-py3.9.egg/federatedscope/core/fed_runner.py", line 136, in run
   self._handle_msg(msg)
 File "/home/liangma/miniconda3/envs/fedscope/lib/python3.9/site-packages/federatedscope-0.1.0-py3.9.egg/federatedscope/core/fed_runner.py", line 254, in _handle_msg
   self.client[each_receiver].msg_handlers[msg.msg_type](msg)
 File "/home/liangma/miniconda3/envs/fedscope/lib/python3.9/site-packages/federatedscope-0.1.0-py3.9.egg/federatedscope/core/worker/client.py", line 202, in callback_funcs_for_model_para
   sample_size, model_para_all, results = self.trainer.train()
 File "/home/liangma/miniconda3/envs/fedscope/lib/python3.9/site-packages/federatedscope-0.1.0-py3.9.egg/federatedscope/core/trainers/trainer.py", line 374, in train
   self._run_routine("train", hooks_set, target_data_split_name)
 File "/home/liangma/miniconda3/envs/fedscope/lib/python3.9/site-packages/federatedscope-0.1.0-py3.9.egg/federatedscope/core/trainers/trainer.py", line 208, in _run_routine
   hook(self.ctx)
 File "/home/liangma/miniconda3/envs/fedscope/lib/python3.9/site-packages/federatedscope-0.1.0-py3.9.egg/federatedscope/core/trainers/trainer.py", line 474, in _hook_on_fit_start_init
   ctx.model.to(ctx.device)
 File "/home/liangma/miniconda3/envs/fedscope/lib/python3.9/site-packages/torch/nn/modules/module.py", line 899, in to
   return self._apply(convert)
 File "/home/liangma/miniconda3/envs/fedscope/lib/python3.9/site-packages/torch/nn/modules/module.py", line 570, in _apply
   module._apply(fn)
 File "/home/liangma/miniconda3/envs/fedscope/lib/python3.9/site-packages/torch/nn/modules/module.py", line 593, in _apply
   param_applied = fn(param)
 File "/home/liangma/miniconda3/envs/fedscope/lib/python3.9/site-packages/torch/nn/modules/module.py", line 897, in convert
   return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
 File "/home/liangma/miniconda3/envs/fedscope/lib/python3.9/site-packages/torch/cuda/__init__.py", line 208, in _lazy_init
   raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled

The optimizer may track some values across different training routines

  • Sine the optimizer is initialized within context, different training routines will use the same optimizer.
  • Some optimizers, like Adam, will track the past momentum.

Therefore, the optimizer may track some state variables across different training routines. Considering the initialized model of each training routine is broadcast by the server, maybe it is unnecessary or even wrong to track past variables.

An Error happens when batch_size is larger than the number of clients' local samples

When drop_last=True and batch_size is larger than the number of clients' local samples, no data is used for local training and the following error happens:

File "/root/miniconda3/lib/python3.9/site-packages/federatedscope-0.1.0-py3.9.egg/federatedscope/core/trainers/context.py", line 154, in pre_calculate_batch_epoch_num
    num_train_epoch = math.ceil(local_update_steps / num_train_batch)
ZeroDivisionError: division by zero

Terse logging

The results (most are floating numbers) are printed to the stdout without controlling the preserved precision. Thus, the reported results look lengthy.

Redundancy in the log files

A Fedavg on 5% of FEMNIST trail will produce a 500 kb log each round:
with 80% eval logs like 2022-04-13 16:33:24,901 (client:264) INFO: Client #1: (Evaluation (test set) at Round #26) test_loss is 79.352451. And 10% is server results and 10% is train informations.

If the round is 500, 1000, or much larger, the log files will take up too much space with a lot of redundancy. @yxdyc

Mechanisms to support asynchronous training protocol in FL

As the title said.
Some ideas can be borrowed from asynchronous SGD, including but not limited to:

  • Timeout strategy
  • oversampling
  • staleness toleration
  • variances toleration

A simulator can also be provided for running asynchronous FL standalone.

Improve logging content with more metrics and better readability

We aim to improve logging content with more metrics and better readability:

  • the systematic metrics are missing. We need to add more metrics to reflect system performance, such as communication and computational efficiency.

  • we may add metric to reflect the convergence, e.g., the number of rounds to converge

We can discuss specific metric requirements and logging timing here @rayrayraykk

Guidance for running the example in FederatedScope/scripts/

Although FederatedScope provides different kinds of scripts in FederatedScope/scripts/, it is a little hard for users to understand these scripts without some guidance, such as which example is running for a certain script and how to use these scripts for customized tasks.

Maybe some detailed guidance on the scripts should be provided for users. Thanks :)

Monitoring improvement: visualization support

The monitoring can be improved in terms of the visualization support.

To use visualization tools such as wandb or tensorboard, we currently parse the log file after the results saved. We need to support results logging in real time rather the two-step style. Besides, the parsing process should be automatic for better usability.

We can discuss other requirements for visualization here @rayrayraykk

Keep the label distributions of train and test set consistent within one client

When using the splitter for customized datasets, the label distributions of train and test sets are independent within one client.
And it might cause meaningless observation on the client-wise performance because of such independence of distribution of train and test set.

IMO, FederatedScope can provide an option for users to keep the label distributions of train and test set consistent within one client when using the splitter, which can be useful in some tasks such as personalization federated learning.

Some questions about cross-device FL.

Hello guys!
I have read the tutorial about the FederatedScope. It seems that the whole project is based on the Python and the cross-device part is just simulation.
I wonder is there any cross-language design to deal with the communication between the cllient and the server, for example, with Android(Java) in the mobile phone and Linux(Java/Python) in the server. Because, you know, some divices lack the Python enviroument.
What's more, is there any trial on the real devices especially cross-device part?
I will be appreciated if you cute guys can solve my doubts.

Thanks to you for your payment on the FederatedScope!

Incorrect Evaluation

In each round, multiple evaluation results are reported in the logs, each of which seems to be the results on a fraction of clients.

Improve the current finetune mechanism

Current finetune is implemented by reusing the training routine, so we have to store the variables in context that belong to the training process and recover them after finetuning. Besides, current finetune doesn't support training by epoch. So maybe we can sperate a single routine, and value of "ctx.cur_mode" for finetune, for example, ctx.cur_mode=="finetune".

Invalid system metric names

There are metrics that have invalid names. It seems that this is caused by incorrect recursive concatenation of strings.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.