GithubHelp home page GithubHelp logo

karhoutam / fl-bench Goto Github PK

View Code? Open in Web Editor NEW
443.0 6.0 67.0 8.22 MB

Benchmark of federated learning. Dedicated to the community. 🤗

License: MIT License

Shell 3.04% Python 96.40% Dockerfile 0.07% Jupyter Notebook 0.49%
federated-learning personalized-federated-learning pytorch-implementation federated-learning-framework deep-learning machine-learning

fl-bench's Issues

Comparison of the results of FedAvg on Cifar with the original paper

I used the FedAvg CNN code defined in your warehouse to train the Cifar10 data set. The final result can only converge to about 0.8, which is quite different from what was published in the author's paper. Have you tried this experiment?

The following is the result in the paper. It takes more than 500 rounds for me to converge to 0.8, and I will not be able to improve it later.
未命名

[Question] 使用perfedavg的结果问题

您好,感谢您的开源精神,我认为这个项目非常有帮助,所以我也通过微信赞助了您,以表心意。同时我也遇到了一些疑惑,我在运行的是perfedavg的时候,发现图像并不是慢慢增长的。
mnist  3

image

[Implementation Error] algorithm "ccvr" code lost a "()"

p1

https://github.com/KarhouTam/FL-bench/blob/54d65a7d91bcf255a16381e103102142d34a72d8/src/server/ccvr.py#L71C1-L73C14
According to the original formula:

image

The last term 'labels_count[c] - 1' should be wrapped by a '()'

          classes_cov[c] -= labels_count[c] / (labels_count[c] - 1) * (
              classes_mean[c].unsqueeze(1) @ classes_mean[c].unsqueeze(0)
          )

Plz correct this.

p2

Second, in the same file, to fix the problem caused by the lack of specific target data.

    def generate_virtual_representation(
        self, classes_mean: List[torch.Tensor], classes_cov: List[torch.Tensor]
    ):
        data, targets = [], []
        for c, (mean, cov) in enumerate(zip(classes_mean, classes_cov)):

I recommend you add the following code to avoid collapse.

    def generate_virtual_representation(
        self, classes_mean: List[torch.Tensor], classes_cov: List[torch.Tensor]
    ):
        data, targets = [], []
        for c, (mean, cov) in enumerate(zip(classes_mean, classes_cov)):
            if torch.all(torch.isnan(mean)) or torch.all(torch.isnan(cov)):
                continue

Loss function in FedLC

In FedLC, the logits of the right class doesn't exist in the denominator of modified CE loss function, I think it doesn't converge. Do you have the same feeling? In your implements, that item is in the denominator.

What'more, FedLC has the same idea as DMFL, IJCNN 2021.

有关FedAP算法的问题

大佬,我想请问一下您,您复现的算法FedAP中的f-FedAP和d-FedAP分别指得是啥呢,十分感谢!希望能的得到您的回复

Confusion about test_before and test_after

Hi karhouTam,

I am just confused about understanding the "after" and "before" of test_acc(before) and test_acc(after) in your code. Could you explain them a little bit? Thank you.

Dataset problem

Hello, if I want to run this framework on my own dataset(image), what should I do?

运行结果问题

          > 首先感谢支持。❤

关于 perfedavg,perfedavg 将联邦学习与元学习结合。元学习中经常遭人诟病的就是其难以训练的特点,加上原 paper 中是使用 MLP 来作为模型骨架在 mnist 上训练的。所以言下之意是 perfedavg 在非凸模型(如 CNN)的训练状况其实是无法保障的。

所以关于您的问题,不好意思我也无法解释清楚为什么 perfedavg 的训练曲线并不平稳,我只能保证我的代码是能正确反应 perfedavg 的运行过程。

您好,大佬,很不好意思再次来叨扰您,我在运行的时候出现,loss是nan,我想问一下这是正常现象吗?
image

Originally posted by @TigerAB1 in #30 (comment)

Data split of femnist dataset

hello, thanks for your code.
I have a question about the data split of femnist dataset.
I ran the following command and expect the dataset to be splitted into 10 parts.

./preprocess.sh -s niid --sf 1.0 -k 0 -t sample --iu 10

However, the args.json still indicates the the client number is 36 like this.

{"dataset": "femnist", "client_num": 36, "fraction": 0.5, "seed": 42, "split": "sample", "iid": true}

I don't know what did I miss and could you kindly explain? Thx!!

关于数据集分类

感谢您的分享!
我想问一些关于数据集分类的问题:
当我运行 generate_data.py -d cifar10 -c 4 -cn 10 时,是否为每个client分配4个类呢?每个类的数量是否相同?

期待您的答复

are running_var, running_mean, num_batches_tracked keys trainble??

Hi, i recently found your wonderful FL-bench repository.
i have question about your code structure.

i'm using python 3.10 and torch 1.13.1 version.

in FL-bench/src/client/fedavg.py code,

line 73 if not param.requires_grad

generates running_mean, running_var, num_batches_tracked keys.

for what i know, fedavg only updates weight, bias. and keep the batchnorm var(running_mean, running_var, num_batches_tracked keys) same.

so, i'm guessing your machine does not generate running_mean, running_var, num_batches_tracked keys.

is this because of my library version mismatch of your version?

Bug report of FedMD code

the code of FedMD algorithmn not work.
When I run the command:

cd src/server 
python fedmd.py

An error has occurred with the following error message:

Traceback (most recent call last):
File "/home/cjj/gitproject/FL-bench/src/server/fedmd.py", line 78, in
server = FedMDServer()
File "/home/cjj/gitproject/FL-bench/src/server/fedmd.py", line 39, in init
self.trainer = FedMDClient(
File "/home/cjj/gitproject/FL-bench/src/client/fedmd.py", line 24, in init
self.public_dataset = DATASETS[self.args.public_dataset](
TypeError: MNIST.init() got an unexpected keyword argument 'transform'

I think the bug maybe in the file src/client/fedmd.py of function FedMDClient__.init__()

My environment

Python 3.10.10

Experiment Arguments:

{
'model': 'lenet5',
'dataset': 'cifar10',
'seed': 42,
'join_ratio': 0.1,
'global_epoch': 100,
'local_epoch': 5,
'finetune_epoch': 0,
'test_gap': 100,
'eval_test': 1,
'eval_train': 0,
'local_lr': 0.01,
'momentum': 0.0,
'weight_decay': 0.0,
'verbose_gap': 100000,
'batch_size': 32,
'visible': 0,
'global_testset': 0,
'straggler_ratio': 0,
'straggler_min_local_epoch': 1,
'use_cuda': 1,
'save_log': 1,
'save_model': 0,
'save_fig': 1,
'save_metrics': 1,
'digest_epoch': 1,
'public_dataset': 'mnist',
'public_batch_size': 32,
'public_batch_num': 5,
'dataset_args': {'dataset': 'cifar10', 'client_num': 100, 'fraction': 0.5, 'seed': 42, 'split': 'sample', 'alpha': 0.1, 'least_samples': 40}
}

FedPer - Set split point of neural network at different locations

Hi there!

How can I set the split point at different locations in a neural network e.g. ResNet18? For example, in their paper (see Figure 4a&b), they use different amount of layers in the classifier, thus different amount of layers in the base.

Is this already implemented?

Best regards,
W

is the trainset and testset same in the same client?

hello, thanks for your code.
i have a question for my study, is that is the trainset and testset same in the same client?
because when my dirichlet parameter alpha is 2, i found that, the accuracy is about 50% in 100 round, and i saw the label y is strange like tensor([7, 6, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 6, 6, 7, 7, 7, 7]), do you think it is two many 7?
please let me know your idea soon, it is very important to me. thank you very much.

TypeError: 'type' object is not subscriptable

C:\Users\20834\anaconda3\envs\ChatGLM-6B-new\python.exe F:\FL-bench\data\generate_data.py -d cifar10 -a 0.1 -cn 100
Traceback (most recent call last):
File "F:\FL-bench\data\generate_data.py", line 16, in
from utils.schemes import (
File "F:\FL-bench\data\utils\schemes_init_.py", line 5, in
from .semantic import semantic_partition
File "F:\FL-bench\data\utils\schemes\semantic.py", line 22, in
from src.config.utils import get_best_device
File "F:\FL-bench\src\config\utils.py", line 69, in
src: Union[OrderedDict[str, torch.Tensor], torch.nn.Module],
TypeError: 'type' object is not subscriptable

\FL-bench\src\config\utils.py

def trainable_params(
src: Union[OrderedDict[str, torch.Tensor], torch.nn.Module],
detach=False,
requires_name=False,
) -> Union[List[torch.Tensor], Tuple[List[torch.Tensor], List[str]]]:

It seems like you're encountering a TypeError, which says 'type' object is not subscriptable. This error is most often raised when you are trying to treat a class or a type like a list or dictionary.

Your error seems to be related to this line:

src: Union[OrderedDict[str, torch.Tensor], torch.nn.Module],

The problem here is that Python cannot interpret the code as it's supposed to. The syntax you've used here is typical of Python 3.8+ (PEP 585), where you can directly use built-in types like list, tuple, dict for type hinting.

However, if you're using a version of Python prior to 3.8, you'll need to import these types from the typing module. For Python 3.7 or below, your line of code should look something like:

from typing import Union, List, Tuple, OrderedDict
from torch import Tensor

src: Union[OrderedDict[str, Tensor], torch.nn.Module],

Please verify the Python version you're using. If you're using Python 3.7 or below, consider upgrading your Python to 3.8+ to take advantage of the PEP 585 features. Otherwise, you'll need to import OrderedDict from typing.

fix bug TypeError: 'type' object is not subscriptable add the following source to file \FL-bench\src\config\utils.py

from typing import Union, List, Tuple, OrderedDict

std

Could you please explain how the standard deviation is calculated in your pfedsim paper? The standard deviation I obtained from running your codebase is quite different from yours

Pretrained model loading

Many thanks for the contribution to FL community. Really benefit a lot.

When I wanted to load the pretrained model, I didn't find a universal/easy way to do it, (i.e., in args).

After checking the code, I believe I should change the first parameter in trainable_params change to my desired checkpoints.

        self.model = use_model.to(self.device)
        
        # FIXME: using pre-trained models
        init_trainable_params, self.trainable_params_name = trainable_params(
            self.model, detach=True, requires_name=True
        )

Am I corrected? Thank you very much for your reply!

question

"How does your codebase implement testing for each category of domain? When I run it, I only get one result for the domain."

数据集生成和示例demo运行的问题

非常感谢这个联邦学习算法仓库,然而在运行时却出现了一个问题,求教。

python generate_data.py -d cifar10 -a 0.1 -cn 100

Traceback (most recent call last):
File "generate_data.py", line 16, in
from utils.schemes import (
File "/root/FL-bench-master/data/utils/schemes/init.py", line 5, in
from .semantic import semantic_partition
File "/root/FL-bench-master/data/utils/schemes/semantic.py", line 22, in
from src.config.utils import get_best_device
File "/root/FL-bench-master/src/config/utils.py", line 69, in
src: Union[OrderedDict[str, torch.Tensor], torch.nn.Module],
TypeError: 'type' object is not subscriptable

生成数据集时出现了这个报错,还没有找到解决办法。

Question of the output and the test accuracy

你好,非常感谢你的这个非常完整的FL的框架,非常的方便,但我有点小问题关于输出,假如你有时间希望可以帮我看看。非常感谢你的帮助

我在tiny-imagenet上训练resnet-18, 使用如下的运行命令:
python src/server/fedavg.py -m res18 -d tiny_imagenet -jr 1.0 -ge 20 -le 10 -bs 64 -lr 0.01 -mom 0.9 -wd 0.00001 -v 1 -vg 10

得到了如下的结果:
Screenshot from 2023-08-24 10-40-30

(1) 麻烦请问一下最上面的那些结果是他使用没有聚合过(本地训练后)的模型得到的准确率吗?
(2) 最下面的convergence那里的数字是在global 模型上得到的准确率吗?
(3) 具体而言,我应该如何使用您的代码,让他在每轮训练后检测全局模型的准确率,我是应该将--test_gap 设置成1吗?

非常感谢你的代码还有回答,代码写的很好,但我能力有限,看的有点晕,以上我还是不太清楚。

Evaluation in test phase

In the test phase, it seems that there is only average of local test on clients, but no global test on the server?

problem run pre-treatment

when I run

sed -i "10,14d" pyproject.toml && poetry lock --no-update && poetry install

that is stoped at the 13/15 step with typeError

also

docker build \ -t fl-bench \ --build-arg IMAGE_SOURCE=karhou/ubuntu:basic \ --build-arg CHINA_MAINLAND=false .
Hash for nvidia-cublas-cu12 (12.1.3.1) from archive nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl not found in known hashes (was: sha256:98d15fd621af39d255ca783f5e7b7f17d3f25a3e639a307944576aa17b30cc51)

at /usr/local/lib/python3.10/dist-packages/poetry/installation/executor.py:799 in _validate_archive_hash
795│ archive_hash: str = "sha256:" + get_file_hash(archive)
796│ known_hashes = {f["hash"] for f in package.files if f["file"] == archive.name}
797│
798│ if archive_hash not in known_hashes:
→ 799│ raise RuntimeError(
800│ f"Hash for {package} from archive {archive.name} not found in"
801│ f" known hashes (was: {archive_hash})"
802│ )
803│

Cannot install nvidia-cublas-cu12.

• Installing nvidia-cusparse-cu12 (12.1.0.106)
The command '/bin/sh -c poetry install' returned a non-zero code: 1

at the same step 13/15

I don't know it is possible to run the main code with those problem ?

Pretrained model evaluation

First of all, thank you very much for the project! I think a lot of researchers/practitioners can benefit from it! It is a huge contribution to the area!

I have a question related to the evaluation of the pretrained model.
I have trained an algorithm (e.g. FedRep) with standard hyperparams on cifar10 and saved the resulting model.

Next, following the instruction, I provided a path to the --external_model_params_file parameter (just hardcoded appropriate value to default path) and everything successfully loaded.

But for me, it is not clear how to evaluate this model.

I did the following:

from src.server.fedrep import FedRepServer

model: FedRepServer = FedRepServer() # path to the weights hardcoded

model.test()
out = model.test_results

print(out)

, and the results I have are way worse compared to the ones I have in cifar10_log.html file:

When training ended:
{100: {'loss': '0.6021 -> 0.0000', 'accuracy': '74.69% -> 0.00%'}}

When I loaded:
{1: {'loss': '98.6518 -> 0.0000', 'accuracy': '18.83% -> 0.00%'}}

I am pretty sure I can miss something. Can you elaborate on this, please?

And also a question more specific to FedRep. Shouldn't we have here unique_model=True, as each of the clients keeps the version of the head from the previous round?

Thank you!

自定义数据集划分

您好,
我想按客户机号自定义划分数据类别。比如客户机1、2只有数据类A,客户机3、4只有数据类B......请问您针对我的需求有没有合适的建议?(从哪部分代码开始修改?)

数据生成问题

您好,我想问一下为什么我设置cn为1000的时候,为什么不行?

hello, when i change 'split' to 'user', it can't run anymore

thanks for your code.
i have a question, it that when i change 'split' from 'sample' to 'user', it can't run.
i think may be because there is no test data in client when 'split' in 'user' mode, which is caused by a bug below.

File "/sftpFile/src/client/fedavg.py", line 112, in train_and_log
    loss_before / num_samples,
ZeroDivisionError: division by zero

so my first question is , how to run in the 'user' mode?

and i have another question,
i have notice that, in the 'sample' split mode, the code use 'test data' and 'train data' in the same client, does it work? because i test the acc is 8-90%, higher than the center training, if it is overfitting? and i want to know how to evaluate the acc, from the chart, or the test result?

Thank you very much!

  • the chart

image

- the test result

image

关于finetune的问题

您好,首先非常感谢您的开源精神和贡献,我在学习您的代码的时候我发现如果在生成数据的时候,如果我用--split user,那么在
` def finetune(self):
"""
The fine-tune function. If your method has different fine-tuning opeation, consider to override this.
This function will only be activated while in FL test round.
"""
self.model.train()
for _ in range(self.args.finetune_epoch):
for x, y in self.trainloader:
if len(x) <= 1:
continue

            x, y = x.to(self.device), y.to(self.device)
            logit = self.model(x)
            loss = self.criterion(logit, y)
            self.optimizer.zero_grad()
            loss.backward()
            self.optimizer.step()`时候,由于在test的时候for client_id in self.test_clients: ,那么self.trainloader且不是空的?不知道是我理解错误还是什么,如果您百忙之中能回答我的问题,我将不胜感激。

cluster aggregation issue

def aggregate_clusterwise(self):

in this function, all client's deltas in self.delta_list are used in cluster aggregation, including the clients that are not participanting the current round.

I don't think it matches the meaning of the original paper. Please correct me if my understanding is wrong

python generate_data.py -d medmnistC -a 0.1 -cn 100

When I use the command python generate_data.py -d medmnistC -a 0.1 -cn 100, it takes a long time to execute and seems to fail because I previously used python generate_data.py -d medmnistC -a 0.5 -cn 100. Do you know how to resolve this issue?

Opinion about dataset split

Hi, KarhouTam.
I recently talked about dataset split for Traditional FL(fedavg, fedprox, feddyn, etc. ...) with my colleague.
my colleague insisted that in order to evaluate the FL algorithm, i should evaluate the model on isolated dataset
(i.e. when MNIST, first split 6000~ imgs to test dataset for global model evaluation and then, assign the rest of imgs to each client for test and eval).
as far as i know, global server can't see the entire dataset for privacy issue. right?
i think thats why your dataset creation setting also don't assign a test dataset for global model evaluation.

what is your opinion about this?

About the convergence of FedLC

FedLC's training on CIFAR10/100 tends to breakdown after a random epoch. And in different random seeds, the breakdown epoch seems to be different. The problem doesn't exist in TinyImageNet training. Ask for your help.

About SCAFFOLD

Thanks for your code! I want to know why the curve of SCAFFOLD on emnist behave like this?
794e9b3a-2e9b-462d-a6a7-20771c0323dd

For pFedMe

感谢您的开源代码,麻烦问一下,您在pFedMe代码中是先采样客户端进行训练的,而pFedMe源码中是先训练所有客户端然后采样进行聚合。不理解为什么源码和论文是这样设计的,这样设计会多训练很多没有用的客户,训练时间会增加。而FedAvg是先采样后训练的初衷就是增加训练速度。不知道我的理解对不对。
还请问,先采样后训练这种方式是不是不影响pFedAvg的性能。
image

implementation to segmentation task

Hi KarhouTam.

I'd like to implement your fed benchmark setting to my segmentation task.

but i think the structuring segmentation model properly to your setup is little bit tricky.

what i understand for properly setup to your work, it requires base layer and classifier layer and etc. ....

so is there any advice for setting up segmentation model?

please can somebody helps me to solve this problem

(base) alami@alami-Latitude-7390:~/Téléchargements/FL-bench-master$ sed -i "26,30d" pyproject.toml && poetry lock --no-update && poetry install

RuntimeError

The Poetry configuration is invalid:
- [readme] ['README.md', 'data/README.md'] is not of type 'string'

at /usr/lib/python3/dist-packages/poetry/core/factory.py:43 in create_poetry
39│ message = ""
40│ for error in check_result["errors"]:
41│ message += " - {}\n".format(error)
42│
→ 43│ raise RuntimeError("The Poetry configuration is invalid:\n" + message)
44│
45│ # Load package
46│ name = local_config["name"]
47│ version = local_config["version"]

runtime erro

"I want to specify the model as MobileNetV2 using your code, but I keep getting an error, 'RuntimeError: size mismatch (got input: [10], target: [32]).' Do you know what's happening?"

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.