alibaba / federatedscope Goto Github PK

View Code? Open in Web Editor NEW

1.3K 14.0 203.0 307.24 MB

An easy-to-use federated learning platform

Home Page: https://www.federatedscope.io

License: Apache License 2.0

Python 91.57% Shell 5.35% Dockerfile 0.68% Jupyter Notebook 2.40%

federated-learning machine-learning pytorch

federatedscope's Issues

Will there be the vertical federated learning for the GAN network?

Support different configs for different client

FederatedScope/federatedscope/core/fed_runner.py

Lines 224 to 225 in b4914e6

 model=client_model or get_model( 

 self.cfg.model, client_data, backend=self.cfg.backend),

Here the model uses the global configuration instead of the client configuration.

Local mode should not be early-stopped based on aggregated performances

As the title said.

Keep the label distributions of train and test set consistent within one client

When using the splitter for customized datasets, the label distributions of train and test sets are independent within one client.
And it might cause meaningless observation on the client-wise performance because of such independence of distribution of train and test set.

IMO, FederatedScope can provide an option for users to keep the label distributions of train and test set consistent within one client when using the splitter, which can be useful in some tasks such as personalization federated learning.

Mechanisms to support asynchronous training protocol in FL

As the title said.
Some ideas can be borrowed from asynchronous SGD, including but not limited to:

Timeout strategy
oversampling
staleness toleration
variances toleration

A simulator can also be provided for running asynchronous FL standalone.

Extremely slow in instantiating FL runner

When we use subthread to execute FL (what we do in autotune module now), @rayrayraykk observe that it is extremely slow to instantiate an FL runner (data loader may be the bottleneck).

Is there some complete tutorial about how to put into myself data/model/trainer and run a FL task?

Hi, I am trying to follow up on the guidance here: https://federatedscope.io/docs/own-case/, to add the data/model/trainer etc., into the project and run up an FL task.
However, this guidance is not that clear, particularly, 1). what does the config section work; 2). how to unify the customized data/model/trainer/config together to complete an FL task.
Is that possible to have complete guidance by using something like mnist/cifar10?

A keyword indicates the task type

Is it possible to add a keyword, such as cfg.federate.task_type, to indicate the task type of each client? It is useful in calculating the loss function because y_true should be long and float respectively for classification and regression tasks.

Metric of regression task

FederatedScope/federatedscope/core/monitors/metric_calculator.py

Line 165 in b716a8c

def eval_rmse(y_true, y_pred, **kwargs):

When doing regression task, it seems incorrect to use y_pred as input because it equals argmax of y_prob. Is there any suggested metric for regression?

A discussion about the efficiency of monitor

FederatedScope/federatedscope/core/monitors/monitor.py

Line 266 in b716a8c

with open(os.path.join(self.outdir, "eval_results.raw"),

Will repeatedly opening a file deteriorate the efficiency?
As a vanilla python file stream (v.s. tf SummaryWriter), is it possible to cache the logs without blocking the main thread?

yo bro，我想问问跟fate比起来，scope有没有提供像fate的federatedML那样的联邦算法库？

请教两个问题：
1、federatedScope有没有提供像fate的federatedML那样的联邦算法库？fate的federatedMl在横向纵向lr、神经网络、决策树等等都有联邦算法提供可以直接使用
2、federatedScope的架构实现上相较fate有没有更优异的地方？
3、federatedScope有没有已经现网生产落地部署、或者是商用的案例？

Terse logging

The results (most are floating numbers) are printed to the stdout without controlling the preserved precision. Thus, the reported results look lengthy.

The unit tests use the same shared global_cfg

The unit tests use the same shared global_cfg, which does not follow our principles for the use of config.

Besides, the backup operation for the configs may be redundant.

Improve the current finetune mechanism

Current finetune is implemented by reusing the training routine, so we have to store the variables in context that belong to the training process and recover them after finetuning. Besides, current finetune doesn't support training by epoch. So maybe we can sperate a single routine, and value of "ctx.cur_mode" for finetune, for example, ctx.cur_mode=="finetune".

Confusing caliber

'Results_raw': {'client_individual': {'val_loss': 0.7106942534446716, 'test_loss': 0.7106942534446716, 'test_avg_loss': 0.7106942534446716, 'test_total': 1.0, 'val_avg_loss': 0.7106942534446716, 'val_total': 1.0}, 'client_summarized_weighted_avg': {'val_loss'

It is difficult for users to know client_individual means the best individual results.

Redundancy in the log files

A Fedavg on 5% of FEMNIST trail will produce a 500 kb log each round:
with 80% eval logs like 2022-04-13 16:33:24,901 (client:264) INFO: Client #1: (Evaluation (test set) at Round #26) test_loss is 79.352451. And 10% is server results and 10% is train informations.

If the round is 500, 1000, or much larger, the log files will take up too much space with a lot of redundancy. @yxdyc

The optimizer may track some values across different training routines

Sine the optimizer is initialized within context, different training routines will use the same optimizer.
Some optimizers, like Adam, will track the past momentum.

Therefore, the optimizer may track some state variables across different training routines. Considering the initialized model of each training routine is broadcast by the server, maybe it is unnecessary or even wrong to track past variables.

我在国内从你们那边跑例子下数据下不下来怎么解决

in Quick Start, build docker image and run with docker env command error

in Quick Start, build docker image and run with docker env command error，
in docs ，the command is {docker run --gpus device=all --rm --it --name "fedscope" -w $(pwd) alibaba/federatedscope:base-env-torch1.10 /bin/bash"}，
it should be {docker run --gpus device=all --rm -it --name "fedscope" -w $(pwd) alibaba/federatedscope:base-env-torch1.10 /bin/bash"} or {docker run --gpus device=all --rm -i -t --name "fedscope" -w $(pwd) alibaba/federatedscope:base-env-torch1.10 /bin/bash"}

Cannot import federatedscope

Although I have successfully completed the installation procedure, I cannot import federatedscope from path other than the root of the cloned repo.

windows无法下载数据包

#95 修复后，windows无法下载数据包，根据日志记录
2022-05-20 22:12:56,551 (utils:89) INFO: the output dir is exp\sub_exp_20220520221256
2022-05-20 22:12:56,589 (utils:214) INFO: Downloading https://federatedscope.oss-cn-beijing.aliyuncs.com/shakespeare_all_data.zip)

似乎url后面多了一个）。通过修改federatedscope\cv\dataset\leaf_cv.py 89行 download_url(f'{url}/{name})', self.raw_dir) 去除多余）后，可以正常下载。

Landmark dataset

Google's landmark datasets (e.g., GLD-23k, GLD-160k) are widely used in evaluating federated optimization algos, and TFF has integrated the datasets (see https://www.tensorflow.org/federated/api_docs/python/tff/simulation/datasets/gldv2/load_data). I am wondering is it possible to use them in FederatedScope.

能否提供一个纵向联邦学习的例子

FederatedScope always use root logger

When incorporating other packages, this might affect the logging behaviors of involved packages.

`federate.sample_client_rate=xx` will result in no output when conducting FedAvg on Cora.

How to reproduce:
python federatedscope/main.py --cfg federatedscope/gfl/baseline/fedavg_gnn_node_fullbatch_citation.yaml federate.sample_client_rate 1.0

But
python federatedscope/main.py --cfg federatedscope/gfl/baseline/fedavg_gnn_node_fullbatch_citation.yaml federate.sample_client_num 5 works fine.

In fact, federate.sample_client_rate=1.0 equals federate.sample_client_num=5. @yxdyc @rayrayraykk

A problem when using Adam optimizer

FederatedScope/federatedscope/core/auxiliaries/optimizer_builder.py

Line 14 in 28d325b

**kwargs)

FederatedScope/federatedscope/core/trainers/context.py

Line 108 in 28d325b

momentum=self.cfg.optimizer.momentum)

Adam got an unexpected keyword argument 'momentum'.

A gRPC-related suggestion

In the current main branch (commit 954322c), the implementation of the gRPCCommManager class (see federatedscope/core/communication.py) largely refers to that in FedML (see https://github.com/FedML-AI/FedML/blob/master/fedml_core/distributed/communication/gRPC/grpc_comm_manager.py in commit 0fb63dd157e55ee603b7049568bf4c4ed0586e71), as commented in FederatedScope's codebase. This class is based on gRPC, a modern open source high performance Remote Procedure Call (RPC) framework. A gRPCCommManager (i) keeps addresses of potential message receivers in a dict/list collection and (ii) has wrapper functions that call APIs in gRPC for message sending and receiving. Many similar variants of such wrapper functions have been widely adopted in related packages. Although FederatedScope obeys the Apache-2.0 License and had a declaration of FedML's copyright, in order to avoid the risk of unintended infringement and unnecessary disputes with FedML, we re-implement this class by refering to the examples in gRPC tutorial in this commit.

One quick question

Incorrect Evaluation

In each round, multiple evaluation results are reported in the logs, each of which seems to be the results on a fraction of clients.

Unexpected behavior when checking the sample_client_num

FederatedScope/federatedscope/core/configs/cfg_fl_setting.py

Lines 81 to 82 in 4a986e0

 sample_client_num_valid = (0 < cfg.federate.sample_client_num <= 

 cfg.federate.client_num)

The term cfg.federate.client_num is allowed to be 0, which implies that the client_num would be determined by the dataset after the loading process. However, the above assertion happens before data loading and thus causes some unexpected behaviors, such as forcibly setting sample_client_num to the same value as cfg.federate.client_num:

FederatedScope/federatedscope/core/configs/cfg_fl_setting.py

Lines 103 to 105 in 4a986e0

 if non_sample_case or not sample_cfg_valid: 

 # (a) use all clients 

 cfg.federate.sample_client_num = cfg.federate.client_num

Is it really necessary to confine the range of the random numbers?

FederatedScope/federatedscope/core/secret_sharing/secret_sharing.py

Line 59 in 65c1136

secret_seq = np.random.randint(low=0, high=self.mod_number, size=shape)

i want contribute some code to FederatedScope，which branch to commit?

@joneswong @glendon84 @Osier-Yi @xieyxclack

Incorrect log file path

FederatedScope/federatedscope/core/auxiliaries/utils.py

Line 68 in 7815265

cfg.outdir = os.path.join(

The folder to hold the log file is augmented incorrectly.

Monitoring improvement: visualization support

The monitoring can be improved in terms of the visualization support.

To use visualization tools such as wandb or tensorboard, we currently parse the log file after the results saved. We need to support results logging in real time rather the two-step style. Besides, the parsing process should be automatic for better usability.

We can discuss other requirements for visualization here @rayrayraykk

The global cfg used in aggregator

FederatedScope/federatedscope/core/aggregator.py

Line 7 in 595aad8

from federatedscope.core.configs.config import global_cfg

The global config should not be used here.

Invalid system metric names

There are metrics that have invalid names. It seems that this is caused by incorrect recursive concatenation of strings.

report cuda error when trying to launch up the demo case

Hi when I am trying to launch up the demo case, cuda relevant error was reported as below:

I am using conda to manage the environment. in other env I have the pytorch works on cuda without any problem.
I think this could be the installation issue-- I did not install anything by myself, totally following your guidance.
My cuda version:
NVIDIA-SMI 510.47.03 Driver Version: 510.47.03 CUDA Version: 11.6
and my torch version: 1.10.1

(fedscope) liangma@lMa-X1:~/prj/FederatedScope$ python federatedscope/main.py --cfg federatedscope/example_configs/femnist.yaml

...
2022-05-13 22:06:09,249 (server:520) INFO: ----------- Starting training (Round #0) -------------
Traceback (most recent call last):
 File "/home/liangma/prj/FederatedScope/federatedscope/main.py", line 41, in <module>
   _ = runner.run()
 File "/home/liangma/miniconda3/envs/fedscope/lib/python3.9/site-packages/federatedscope-0.1.0-py3.9.egg/federatedscope/core/fed_runner.py", line 136, in run
   self._handle_msg(msg)
 File "/home/liangma/miniconda3/envs/fedscope/lib/python3.9/site-packages/federatedscope-0.1.0-py3.9.egg/federatedscope/core/fed_runner.py", line 254, in _handle_msg
   self.client[each_receiver].msg_handlers[msg.msg_type](msg)
 File "/home/liangma/miniconda3/envs/fedscope/lib/python3.9/site-packages/federatedscope-0.1.0-py3.9.egg/federatedscope/core/worker/client.py", line 202, in callback_funcs_for_model_para
   sample_size, model_para_all, results = self.trainer.train()
 File "/home/liangma/miniconda3/envs/fedscope/lib/python3.9/site-packages/federatedscope-0.1.0-py3.9.egg/federatedscope/core/trainers/trainer.py", line 374, in train
   self._run_routine("train", hooks_set, target_data_split_name)
 File "/home/liangma/miniconda3/envs/fedscope/lib/python3.9/site-packages/federatedscope-0.1.0-py3.9.egg/federatedscope/core/trainers/trainer.py", line 208, in _run_routine
   hook(self.ctx)
 File "/home/liangma/miniconda3/envs/fedscope/lib/python3.9/site-packages/federatedscope-0.1.0-py3.9.egg/federatedscope/core/trainers/trainer.py", line 474, in _hook_on_fit_start_init
   ctx.model.to(ctx.device)
 File "/home/liangma/miniconda3/envs/fedscope/lib/python3.9/site-packages/torch/nn/modules/module.py", line 899, in to
   return self._apply(convert)
 File "/home/liangma/miniconda3/envs/fedscope/lib/python3.9/site-packages/torch/nn/modules/module.py", line 570, in _apply
   module._apply(fn)
 File "/home/liangma/miniconda3/envs/fedscope/lib/python3.9/site-packages/torch/nn/modules/module.py", line 593, in _apply
   param_applied = fn(param)
 File "/home/liangma/miniconda3/envs/fedscope/lib/python3.9/site-packages/torch/nn/modules/module.py", line 897, in convert
   return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
 File "/home/liangma/miniconda3/envs/fedscope/lib/python3.9/site-packages/torch/cuda/__init__.py", line 208, in _lazy_init
   raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled

What happens when abs(x) > self.maximum?

FederatedScope/federatedscope/core/secret_sharing/secret_sharing.py

Line 86 in 65c1136

def _float2fixedpoint(self, x):

Fail to run the standalone example with the minimal version of requirement

An Error happens in FederatedScope/federatedscope/cv/dataset/leaf_cv.py:

When it is required to download the dataset, a function download_url is imported from the module torch_geometric.data, which is not included in the minimal version of the requirement, (of course it shouldn't be included in the minimal version, in my opinion).

So maybe we should replace the download_url with another implementation here. Thanks :)

Invalid filename for windows environment

FederatedScope/federatedscope/core/auxiliaries/utils.py

Line 62 in 0ed7be6

cfg.outdir, "sub_exp" + datetime.now().strftime('_%m-%d_%H:%M:%S'))

The file name has a character ':', which is forbidden in Windows.

Make flop counting optional

as the title said.

Improve logging content with more metrics and better readability

We aim to improve logging content with more metrics and better readability:

the systematic metrics are missing. We need to add more metrics to reflect system performance, such as communication and computational efficiency.
we may add metric to reflect the convergence, e.g., the number of rounds to converge

We can discuss specific metric requirements and logging timing here @rayrayraykk

Did scope support vertical federated learning and is there an example

I went throught the example of scope and didn't find a vertical federated learning example. Did flower support vertical federated learning and is there an example?

Some questions about cross-device FL.

Hello guys!
I have read the tutorial about the FederatedScope. It seems that the whole project is based on the Python and the cross-device part is just simulation.
I wonder is there any cross-language design to deal with the communication between the cllient and the server, for example, with Android(Java) in the mobile phone and Linux(Java/Python) in the server. Because, you know, some divices lack the Python enviroument.
What's more, is there any trial on the real devices especially cross-device part?
I will be appreciated if you cute guys can solve my doubts.

Thanks to you for your payment on the FederatedScope!

AttributeError: Attempted to set type to proximal_regularizer, but CfgNode is immutable.Could please help me to overcome this problem,which happens when I call for the fedprox algorithm:python federatedscope/main.py --cfg federatedscope/example_configs/femnist.yaml fedprox.use True

Docs rendering error

https://federatedscope.io/docs/algozoo/#additional-functions-for-fedprox-algorithm

Here⬆️The article format seems to be a bit messed up.

An example for distributed mode

An executable toy example to demonstrate how to run distributed mode with FederatedScope.

Confusing server identifier

FederatedScope/federatedscope/core/worker/server.py

Line 275 in 66a0766

B_val, rnd=self.state, role='Server #')

FederatedScope/federatedscope/core/worker/server.py

Line 137 in 66a0766

logger.info('Server #{:d}: Listen to {}:{}...'.format(

Is it necessary to report server id? it is inconsistent now.

An Error happens when batch_size is larger than the number of clients' local samples

When drop_last=True and batch_size is larger than the number of clients' local samples, no data is used for local training and the following error happens:

File "/root/miniconda3/lib/python3.9/site-packages/federatedscope-0.1.0-py3.9.egg/federatedscope/core/trainers/context.py", line 154, in pre_calculate_batch_epoch_num
    num_train_epoch = math.ceil(local_update_steps / num_train_batch)
ZeroDivisionError: division by zero

Guidance for running the example in FederatedScope/scripts/

Although FederatedScope provides different kinds of scripts in FederatedScope/scripts/, it is a little hard for users to understand these scripts without some guidance, such as which example is running for a certain script and how to use these scripts for customized tasks.

Maybe some detailed guidance on the scripts should be provided for users. Thanks :)

	model=client_model or get_model(
	self.cfg.model, client_data, backend=self.cfg.backend),

	sample_client_num_valid = (0 < cfg.federate.sample_client_num <=
	cfg.federate.client_num)

	if non_sample_case or not sample_cfg_valid:
	# (a) use all clients
	cfg.federate.sample_client_num = cfg.federate.client_num

alibaba / federatedscope Goto Github PK

federatedscope's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs