karhoutam / fl-bench Goto Github PK
View Code? Open in Web Editor NEWBenchmark of federated learning. Dedicated to the community. 🤗
License: MIT License
Benchmark of federated learning. Dedicated to the community. 🤗
License: MIT License
I used the FedAvg CNN code defined in your warehouse to train the Cifar10 data set. The final result can only converge to about 0.8, which is quite different from what was published in the author's paper. Have you tried this experiment?
The following is the result in the paper. It takes more than 500 rounds for me to converge to 0.8, and I will not be able to improve it later.
https://github.com/KarhouTam/FL-bench/blob/54d65a7d91bcf255a16381e103102142d34a72d8/src/server/ccvr.py#L71C1-L73C14
According to the original formula:
The last term 'labels_count[c] - 1' should be wrapped by a '()'
classes_cov[c] -= labels_count[c] / (labels_count[c] - 1) * (
classes_mean[c].unsqueeze(1) @ classes_mean[c].unsqueeze(0)
)
Plz correct this.
Second, in the same file, to fix the problem caused by the lack of specific target data.
def generate_virtual_representation(
self, classes_mean: List[torch.Tensor], classes_cov: List[torch.Tensor]
):
data, targets = [], []
for c, (mean, cov) in enumerate(zip(classes_mean, classes_cov)):
I recommend you add the following code to avoid collapse.
def generate_virtual_representation(
self, classes_mean: List[torch.Tensor], classes_cov: List[torch.Tensor]
):
data, targets = [], []
for c, (mean, cov) in enumerate(zip(classes_mean, classes_cov)):
if torch.all(torch.isnan(mean)) or torch.all(torch.isnan(cov)):
continue
In FedLC, the logits of the right class doesn't exist in the denominator of modified CE loss function, I think it doesn't converge. Do you have the same feeling? In your implements, that item is in the denominator.
What'more, FedLC has the same idea as DMFL, IJCNN 2021.
Describe the bug
To Reproduce
Expected behavior
Screenshots
Additional context
大佬,我想请问一下您,您复现的算法FedAP中的f-FedAP和d-FedAP分别指得是啥呢,十分感谢!希望能的得到您的回复
Hi karhouTam,
I am just confused about understanding the "after" and "before" of test_acc(before) and test_acc(after) in your code. Could you explain them a little bit? Thank you.
Hello, if I want to run this framework on my own dataset(image), what should I do?
> 首先感谢支持。❤
关于 perfedavg,perfedavg 将联邦学习与元学习结合。元学习中经常遭人诟病的就是其难以训练的特点,加上原 paper 中是使用 MLP 来作为模型骨架在 mnist 上训练的。所以言下之意是 perfedavg 在非凸模型(如 CNN)的训练状况其实是无法保障的。
所以关于您的问题,不好意思我也无法解释清楚为什么 perfedavg 的训练曲线并不平稳,我只能保证我的代码是能正确反应 perfedavg 的运行过程。
您好,大佬,很不好意思再次来叨扰您,我在运行的时候出现,loss是nan,我想问一下这是正常现象吗?
Originally posted by @TigerAB1 in #30 (comment)
hello, thanks for your code.
I have a question about the data split of femnist dataset.
I ran the following command and expect the dataset to be splitted into 10 parts.
./preprocess.sh -s niid --sf 1.0 -k 0 -t sample --iu 10
However, the args.json still indicates the the client number is 36 like this.
{"dataset": "femnist", "client_num": 36, "fraction": 0.5, "seed": 42, "split": "sample", "iid": true}
I don't know what did I miss and could you kindly explain? Thx!!
感谢您的分享!
我想问一些关于数据集分类的问题:
当我运行 generate_data.py -d cifar10 -c 4 -cn 10 时,是否为每个client分配4个类呢?每个类的数量是否相同?
期待您的答复
Hi, i recently found your wonderful FL-bench repository.
i have question about your code structure.
i'm using python 3.10 and torch 1.13.1 version.
in FL-bench/src/client/fedavg.py code,
line 73 if not param.requires_grad
generates running_mean, running_var, num_batches_tracked keys.
for what i know, fedavg only updates weight, bias. and keep the batchnorm var(running_mean, running_var, num_batches_tracked keys) same.
so, i'm guessing your machine does not generate running_mean, running_var, num_batches_tracked keys.
is this because of my library version mismatch of your version?
Hi. I have changed the "finetune_epoch" parameter in "fedrep" and "fedper" and some other algorithms to fine-tune the classifier layer but it seems that it has no effect on test accuracies.
the code of FedMD algorithmn not work.
When I run the command:
cd src/server
python fedmd.py
An error has occurred with the following error message:
Traceback (most recent call last):
File "/home/cjj/gitproject/FL-bench/src/server/fedmd.py", line 78, in
server = FedMDServer()
File "/home/cjj/gitproject/FL-bench/src/server/fedmd.py", line 39, in init
self.trainer = FedMDClient(
File "/home/cjj/gitproject/FL-bench/src/client/fedmd.py", line 24, in init
self.public_dataset = DATASETS[self.args.public_dataset](
TypeError: MNIST.init() got an unexpected keyword argument 'transform'
I think the bug maybe in the file src/client/fedmd.py of function FedMDClient__.init__()
Python 3.10.10
{
'model': 'lenet5',
'dataset': 'cifar10',
'seed': 42,
'join_ratio': 0.1,
'global_epoch': 100,
'local_epoch': 5,
'finetune_epoch': 0,
'test_gap': 100,
'eval_test': 1,
'eval_train': 0,
'local_lr': 0.01,
'momentum': 0.0,
'weight_decay': 0.0,
'verbose_gap': 100000,
'batch_size': 32,
'visible': 0,
'global_testset': 0,
'straggler_ratio': 0,
'straggler_min_local_epoch': 1,
'use_cuda': 1,
'save_log': 1,
'save_model': 0,
'save_fig': 1,
'save_metrics': 1,
'digest_epoch': 1,
'public_dataset': 'mnist',
'public_batch_size': 32,
'public_batch_num': 5,
'dataset_args': {'dataset': 'cifar10', 'client_num': 100, 'fraction': 0.5, 'seed': 42, 'split': 'sample', 'alpha': 0.1, 'least_samples': 40}
}
Hi there!
How can I set the split point at different locations in a neural network e.g. ResNet18? For example, in their paper (see Figure 4a&b), they use different amount of layers in the classifier, thus different amount of layers in the base.
Is this already implemented?
Best regards,
W
hello, thanks for your code.
i have a question for my study, is that is the trainset and testset same in the same client?
because when my dirichlet parameter alpha is 2, i found that, the accuracy is about 50% in 100 round, and i saw the label y is strange like tensor([7, 6, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 6, 6, 7, 7, 7, 7]), do you think it is two many 7?
please let me know your idea soon, it is very important to me. thank you very much.
hey,
Thank you for providing this repo, really helpful.
Would be nice if you add the below paper in the near future :)
C:\Users\20834\anaconda3\envs\ChatGLM-6B-new\python.exe F:\FL-bench\data\generate_data.py -d cifar10 -a 0.1 -cn 100
Traceback (most recent call last):
File "F:\FL-bench\data\generate_data.py", line 16, in
from utils.schemes import (
File "F:\FL-bench\data\utils\schemes_init_.py", line 5, in
from .semantic import semantic_partition
File "F:\FL-bench\data\utils\schemes\semantic.py", line 22, in
from src.config.utils import get_best_device
File "F:\FL-bench\src\config\utils.py", line 69, in
src: Union[OrderedDict[str, torch.Tensor], torch.nn.Module],
TypeError: 'type' object is not subscriptable
def trainable_params(
src: Union[OrderedDict[str, torch.Tensor], torch.nn.Module],
detach=False,
requires_name=False,
) -> Union[List[torch.Tensor], Tuple[List[torch.Tensor], List[str]]]:
It seems like you're encountering a TypeError, which says 'type' object is not subscriptable
. This error is most often raised when you are trying to treat a class or a type like a list or dictionary.
Your error seems to be related to this line:
src: Union[OrderedDict[str, torch.Tensor], torch.nn.Module],
The problem here is that Python cannot interpret the code as it's supposed to. The syntax you've used here is typical of Python 3.8+ (PEP 585), where you can directly use built-in types like list
, tuple
, dict
for type hinting.
However, if you're using a version of Python prior to 3.8, you'll need to import these types from the typing
module. For Python 3.7 or below, your line of code should look something like:
from typing import Union, List, Tuple, OrderedDict
from torch import Tensor
src: Union[OrderedDict[str, Tensor], torch.nn.Module],
Please verify the Python version you're using. If you're using Python 3.7 or below, consider upgrading your Python to 3.8+ to take advantage of the PEP 585 features. Otherwise, you'll need to import OrderedDict
from typing
.
from typing import Union, List, Tuple, OrderedDict
Could you please explain how the standard deviation is calculated in your pfedsim paper? The standard deviation I obtained from running your codebase is quite different from yours
Many thanks for the contribution to FL community. Really benefit a lot.
When I wanted to load the pretrained model, I didn't find a universal/easy way to do it, (i.e., in args).
After checking the code, I believe I should change the first parameter in trainable_params change to my desired checkpoints.
self.model = use_model.to(self.device)
# FIXME: using pre-trained models
init_trainable_params, self.trainable_params_name = trainable_params(
self.model, detach=True, requires_name=True
)
Am I corrected? Thank you very much for your reply!
"How does your codebase implement testing for each category of domain? When I run it, I only get one result for the domain."
非常感谢这个联邦学习算法仓库,然而在运行时却出现了一个问题,求教。
python generate_data.py -d cifar10 -a 0.1 -cn 100
Traceback (most recent call last):
File "generate_data.py", line 16, in
from utils.schemes import (
File "/root/FL-bench-master/data/utils/schemes/init.py", line 5, in
from .semantic import semantic_partition
File "/root/FL-bench-master/data/utils/schemes/semantic.py", line 22, in
from src.config.utils import get_best_device
File "/root/FL-bench-master/src/config/utils.py", line 69, in
src: Union[OrderedDict[str, torch.Tensor], torch.nn.Module],
TypeError: 'type' object is not subscriptable
生成数据集时出现了这个报错,还没有找到解决办法。
你好,非常感谢你的这个非常完整的FL的框架,非常的方便,但我有点小问题关于输出,假如你有时间希望可以帮我看看。非常感谢你的帮助
我在tiny-imagenet上训练resnet-18, 使用如下的运行命令:
python src/server/fedavg.py -m res18 -d tiny_imagenet -jr 1.0 -ge 20 -le 10 -bs 64 -lr 0.01 -mom 0.9 -wd 0.00001 -v 1 -vg 10
(1) 麻烦请问一下最上面的那些结果是他使用没有聚合过(本地训练后)的模型得到的准确率吗?
(2) 最下面的convergence那里的数字是在global 模型上得到的准确率吗?
(3) 具体而言,我应该如何使用您的代码,让他在每轮训练后检测全局模型的准确率,我是应该将--test_gap 设置成1吗?
非常感谢你的代码还有回答,代码写的很好,但我能力有限,看的有点晕,以上我还是不太清楚。
In the test phase, it seems that there is only average of local test on clients, but no global test on the server?
when I run
sed -i "10,14d" pyproject.toml && poetry lock --no-update && poetry install
that is stoped at the 13/15 step with typeError
also
docker build \ -t fl-bench \ --build-arg IMAGE_SOURCE=karhou/ubuntu:basic \ --build-arg CHINA_MAINLAND=false .
Hash for nvidia-cublas-cu12 (12.1.3.1) from archive nvidia_cublas_cu12-12.1.3.1-py3-none-manylinux1_x86_64.whl not found in known hashes (was: sha256:98d15fd621af39d255ca783f5e7b7f17d3f25a3e639a307944576aa17b30cc51)
at /usr/local/lib/python3.10/dist-packages/poetry/installation/executor.py:799 in _validate_archive_hash
795│ archive_hash: str = "sha256:" + get_file_hash(archive)
796│ known_hashes = {f["hash"] for f in package.files if f["file"] == archive.name}
797│
798│ if archive_hash not in known_hashes:
→ 799│ raise RuntimeError(
800│ f"Hash for {package} from archive {archive.name} not found in"
801│ f" known hashes (was: {archive_hash})"
802│ )
803│
Cannot install nvidia-cublas-cu12.
• Installing nvidia-cusparse-cu12 (12.1.0.106)
The command '/bin/sh -c poetry install' returned a non-zero code: 1
at the same step 13/15
I don't know it is possible to run the main code with those problem ?
Traceback (most recent call last):
File "/data/xiagroup/FL-bench-master/data/utils/run.py", line 16, in
_CURRENT_DIR = Path(file).parent.abspath()
AttributeError: 'PosixPath' object has no attribute 'abspath'
First of all, thank you very much for the project! I think a lot of researchers/practitioners can benefit from it! It is a huge contribution to the area!
I have a question related to the evaluation of the pretrained model.
I have trained an algorithm (e.g. FedRep) with standard hyperparams on cifar10 and saved the resulting model.
Next, following the instruction, I provided a path to the --external_model_params_file
parameter (just hardcoded appropriate value to default path) and everything successfully loaded.
But for me, it is not clear how to evaluate this model.
I did the following:
from src.server.fedrep import FedRepServer
model: FedRepServer = FedRepServer() # path to the weights hardcoded
model.test()
out = model.test_results
print(out)
, and the results I have are way worse compared to the ones I have in cifar10_log.html file:
When training ended:
{100: {'loss': '0.6021 -> 0.0000', 'accuracy': '74.69% -> 0.00%'}}
When I loaded:
{1: {'loss': '98.6518 -> 0.0000', 'accuracy': '18.83% -> 0.00%'}}
I am pretty sure I can miss something. Can you elaborate on this, please?
And also a question more specific to FedRep. Shouldn't we have here unique_model=True, as each of the clients keeps the version of the head from the previous round?
Thank you!
您好,
我想按客户机号自定义划分数据类别。比如客户机1、2只有数据类A,客户机3、4只有数据类B......请问您针对我的需求有没有合适的建议?(从哪部分代码开始修改?)
您好,我想问一下为什么我设置cn为1000的时候,为什么不行?
thanks for your code.
i have a question, it that when i change 'split' from 'sample' to 'user', it can't run.
i think may be because there is no test data in client when 'split' in 'user' mode, which is caused by a bug below.
File "/sftpFile/src/client/fedavg.py", line 112, in train_and_log
loss_before / num_samples,
ZeroDivisionError: division by zero
so my first question is , how to run in the 'user' mode?
and i have another question,
i have notice that, in the 'sample' split mode, the code use 'test data' and 'train data' in the same client, does it work? because i test the acc is 8-90%, higher than the center training, if it is overfitting? and i want to know how to evaluate the acc, from the chart, or the test result?
Thank you very much!
您好,首先非常感谢您的开源精神和贡献,我在学习您的代码的时候我发现如果在生成数据的时候,如果我用--split user,那么在
` def finetune(self):
"""
The fine-tune function. If your method has different fine-tuning opeation, consider to override this.
This function will only be activated while in FL test round.
"""
self.model.train()
for _ in range(self.args.finetune_epoch):
for x, y in self.trainloader:
if len(x) <= 1:
continue
x, y = x.to(self.device), y.to(self.device)
logit = self.model(x)
loss = self.criterion(logit, y)
self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()`时候,由于在test的时候for client_id in self.test_clients: ,那么self.trainloader且不是空的?不知道是我理解错误还是什么,如果您百忙之中能回答我的问题,我将不胜感激。
Line 104 in 58b7bc3
in this function, all client's deltas in self.delta_list are used in cluster aggregation, including the clients that are not participanting the current round.
I don't think it matches the meaning of the original paper. Please correct me if my understanding is wrong
When I use the command python generate_data.py -d medmnistC -a 0.1 -cn 100, it takes a long time to execute and seems to fail because I previously used python generate_data.py -d medmnistC -a 0.5 -cn 100. Do you know how to resolve this issue?
The preprocess.py in data/domain seems like can only generate heterogeneous partition, i.e., each client has only one doamin. How can I generate homogeneous partition for DomainNet. For example, the domain distributions on every client are the same.
Hi, KarhouTam.
I recently talked about dataset split for Traditional FL(fedavg, fedprox, feddyn, etc. ...) with my colleague.
my colleague insisted that in order to evaluate the FL algorithm, i should evaluate the model on isolated dataset
(i.e. when MNIST, first split 6000~ imgs to test dataset for global model evaluation and then, assign the rest of imgs to each client for test and eval).
as far as i know, global server can't see the entire dataset for privacy issue. right?
i think thats why your dataset creation setting also don't assign a test dataset for global model evaluation.
what is your opinion about this?
FedLC's training on CIFAR10/100 tends to breakdown after a random epoch. And in different random seeds, the breakdown epoch seems to be different. The problem doesn't exist in TinyImageNet training. Ask for your help.
COPY failed: forbidden path outside the build context: ../ ()
请问Monitor里步骤2是在哪里执行?
https://github.com/KarhouTam/FL-bench/blob/master/src/client/fedavg.py#L36,我注意到这个transform是作用在整个数据集上,数据增强一般都作用在trainSet吧?testSet不应该进行操作。
Hi KarhouTam.
I'd like to implement your fed benchmark setting to my segmentation task.
but i think the structuring segmentation model properly to your setup is little bit tricky.
what i understand for properly setup to your work, it requires base layer and classifier layer and etc. ....
so is there any advice for setting up segmentation model?
(base) alami@alami-Latitude-7390:~/Téléchargements/FL-bench-master$ sed -i "26,30d" pyproject.toml && poetry lock --no-update && poetry install
RuntimeError
The Poetry configuration is invalid:
- [readme] ['README.md', 'data/README.md'] is not of type 'string'
at /usr/lib/python3/dist-packages/poetry/core/factory.py:43 in create_poetry
39│ message = ""
40│ for error in check_result["errors"]:
41│ message += " - {}\n".format(error)
42│
→ 43│ raise RuntimeError("The Poetry configuration is invalid:\n" + message)
44│
45│ # Load package
46│ name = local_config["name"]
47│ version = local_config["version"]
Describe it in here rather than open an issue for request.
🕊 NOTE: I will not guarantee to fulfill or even respond to your feature request.
"I want to specify the model as MobileNetV2 using your code, but I keep getting an error, 'RuntimeError: size mismatch (got input: [10], target: [32]).' Do you know what's happening?"
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.