fedml-ai / fednlp Goto Github PK

FedNLP: An Industry and Research Integrated Platform for Federated Learning in Natural Language Processing, Backed by FedML, Inc. The Previous Research Version is Accepted to NAACL 2022

federated-learning natural-language-processing nlp machine-learning

fednlp's Introduction

FedNLP has been migrated to https://github.com/FedML-AI/FedML/tree/master/python/app/fednlp

fednlp's People

Contributors

Stargazers

Watchers

fednlp's Issues

object doesn't exist for text classification script

If I run a text classification model with distilbert using:

DATA_NAME=20news
CUDA_VISIBLE_DEVICES=1 python -m experiments.centralized.transformer_exps.main_tc
--dataset ${DATA_NAME}
--data_file ~/fednlp_data/data_files/${DATA_NAME}_data.h5
--partition_file ~/fednlp_data/partition_files/${DATA_NAME}_partition.h5
--partition_method niid_label_clients=100.0_alpha=5.0
--model_type distilbert
--model_name distilbert-base-uncased
--do_lower_case True
--train_batch_size 32
--eval_batch_size 8
--max_seq_length 256
--learning_rate 5e-5
--epochs 20
--evaluate_during_training_steps 500
--output_dir /tmp/${DATA_NAME}_fed/
--n_gpu 1

I got as errror 'KeyError: "Unable to open object (object 'niid_label_clients=100.0_alpha=5.0' doesn't exist)"', but the object should exist?

Accuracy and loss didn't improve for FedAvg on 20news

I am trying to reproduce the FedAvg results on 20news data. However, the FedAvg algorithm on 20news task seems not working. Comparing with the centralized run, the eval accuracy and loss of FedAvg did not make any improvement after many rounds (eval acc 0.063, eval loss 2.969). The results and experiment setting can be checked here: https://wandb.ai/haofuml/fednlp_bl?workspace=user-haofuml

The client side learning rate warm up and scheduler may be an issue for FL.

FedNLP/model/fed_transformers/classification/classification_model.py

Line 487 in bd6dbb9

scheduler = get_linear_schedule_with_warmup(

Combining client side learning rate warmp and scheduler with a distributed optimizer (FedAvg, FedOpt, etc) looks unreasonable in current optimization theory. We need to confirm its effectiveness by experiments. My suggestion is deleting it if the experiments show that it does not improve the accuracy too much, because ML researcher may think this algorithmic combination is wrong. We should avoid such confusion from their mind.

Data Parallelism (DP) during training does not meet expectation (only 3 processes are working, should be FOUR), I will optimize it after ICML deadline.

...

怎么生成partition.h5文件？

Optimizer suggestion for federated learning experiments

1. Client optimizer (sgd) + server optimizer (adam).

Pro: good for cross-device FL since all devices do not need to synchronize the optimizer states;
cons: the accuracy will drop a bit? not sure. nobody has ever explored this based on Transformer. Maybe our paper has this contribution.

2. Client optimizer (adam) + server optimizer (adam).
Pro: good for accuracy
cons: can only work at the cross-silo setting

We will have a discussion of the optimizer performance in our benchmarking paper, which is viewed as a field guide for our benchmarking users.

Refactor Model class to Trainer class

@yuchenlin To distinguish from Model(torch.Module), may I refactor the model class into xxxTrainer? In essence, the code you wrote is a trainer that handles train, eval, load, save, args, config, etc. In huggingface, they also call this kind of class as trainer:

https://github.com/huggingface/transformers/blob/master/src/transformers/trainer.py

Hyperparameters for reproducing the results of the paper?

Hi, thank you for the work.

I am confused regarding the learning rates used in the experiments.
The README.md (experiments/distributed/transformer_exps/README.md) have different server lr's (0.1, 5e-5) with that of the paper (Section 4.3). I am trying to reproduce some experiments as a baseline, but I am reaching either a higher or lower performance than the reported accuracy.

For the seq2seq task (Gigawords), could you report the server and client learning rate? or even better refer me to the wandb project?

Thanks

No protocol specified | Open MPI not enough slots available in the system

No such file or directory: 'cache_dir/distilbert_distilbert-base-uncased_cached_256_ClassificationModel_20news_niid_label_clients=100_alpha=5.0_-1'

If I run the script
"sh run_text_classification.sh FedAvg "niid_label_clients=100_alpha=5.0" 5e-5 0.1 2 0"
it returns an error:
File "~/workspace/FedNLP/data_manager/base_data_manager.py", line 198, in load_federated_data_server
with open(res, "wb") as handle:
FileNotFoundError: [Errno 2] No such file or directory: 'cache_dir/distilbert_distilbert-base-uncased_cached_256_ClassificationModel_20news_niid_label_clients=100_alpha=5.0-1'

Bugs

Dear all,
We try to clone your repository to our own project. However, a lot of bugs remain.
For example:

problem with pip install mpi4py : we need to export MPICC=/usr/lib64/openmpi/bin/mpicc (or another address you may find with a locate command) (solved)
problem with installation of wandb (creating an account, wandb login, api key) (solved)
problem with entity = automl replacement (it is explicated): I've made a bash script to do it then (solved)
problem of missing partitions: we need to change some scripts to remove occurences (for example:

#all_partition_files = ['20news_partition.h5', 'agnews_partition.h5', 'cornell_movie_dialogue_partition.h5', 'onto_partition.h5', 'ploner_partition.h5', 'squad_1.1_partition.h5', 'sst_2_partition.h5', 'wikiner_partition.h5', 'w_nut_partition.h5']
Changed to
all_partition_files = ['20news_partition.h5', 'onto_partition.h5', 'squad_1.1_partition.h5'])

but we always have a KeyError: "Unable to open object (object 'niid_label_clients=100.0_alpha=5.0') doesn't exist"

etc.

Would it be possible to have a call about it? Or juste more information?
Thank you

lstm experiment data loader files missing

In the file 'experiments/centralized/bilstm_exps/main_text_classification':

import data_preprocessing.AGNews.data_loader import data_preprocessing.SST_2.data_loader import data_preprocessing.SemEval2010Task8.data_loader import data_preprocessing.Sentiment140.data_loader import data_preprocessing.news_20.data_loader

These loaders are lost.

Pretraining v.s. Fine-tuning

I got the training result using the default hyper-parameters in our current GitHub. Since we used pretrained weights, the accuracy is already as high as 79% in the first round, and several rounds already got the limit (around 83%). Thus too many round of federated training is unnecessary (our current default is 500, I terminated the training at 100). Check results here:

https://wandb.ai/automl/fednlp/runs/3oc7a3jc/logs?workspace=user-chaoyanghe-com

Given that NLP is dominated by transformer-based pretraining, in our FedNLP, we only need to do federated fine-tuning, right? Actually I have a research idea about federated pretraining for cross-silo setting, let's discuss more after ICML

怎么体现联邦学习？我怎么得到每个节点学习过程中的loss情况呢

Some suggestions/issues in "XXXModel" class

I am trying to simplify the code.

Some suggestions:

I suggest finishing a function in a screen length (less than 100 lines). In some companies, this is hard rule. Even in NLP domain, like huggingface, they also follow this rule well. Their code is readable to me.
I don't see some special training tricks, so the training loop can be finished in a screen length.
when we want to repeat a functionality, it's better to extract it as a function. For example, 1) the early stopping related code repeats twice in the training loop, but the content is nearly the same; 2) defining the trainable parameters can be shrunk into a function (the begining part of train()).
Many variant Transformers' code are merged with different branches. I suggest only consider the models in FL. As a benchmark, two models will be enough.

Some issues:

this class also contains data loading in the training and evaluation loop, which should be handled outside the trainer class in FL, otherwise the performance may downgrade and some hidden issues may happen. Under FedML framework, the design pattern is to finish data loading of each client before starting the training loop.

(May update once I found more)

change FL dataset file name to "--data_file data/data_loaders/${DATA_NAME}_data.pkl"

the name shows "data_loader.pkl", looks confusing. Please change to data.pkl

Windows 10 support

Will FedNLP work on windows machine?

Could you support knowledge distillation-based FL algorithms like FedED, Fedmd, or FedDF

Currently, your platform supports some parameter-average-based FL algorithms. Could you support knowledge-distillation-based FL algorithms like FedED, Fedmd, or FedDF?

FedED: Federated Learning via Ensemble Distillation for Medical Relation Extraction,
Fedmd: Heterogenous federated learning via model distillation. NeurIPS workshop, 2019
FedDF, Ensemble distillation for robust model fusion in federated learning. NeurIPS, 2020

No dataloader in data_preprocessing

Hi, I notice that in the experiments/centralized/bilstm_exps/main_text_classification.py it imports different dataloaders for different tasks. For example:

import data_preprocessing.AGNews.data_loader
import data_preprocessing.SST_2.data_loader
import data_preprocessing.SemEval2010Task8.data_loader
import data_preprocessing.Sentiment140.data_loader
import data_preprocessing.news_20.data_loader

But I didn't find these dataloader in the data_preprocessing file. Are these code removed or lost?

Thanks,

fednlp disappeared in the new url. can you restore it?

Hanging after last round of training

Hi, thanks for the great work.

When running
sh run_text_classification.sh FedOPT "niid_label_clients=100_alpha=100.0" 1e-3 0.1 1 4,
the process does not terminate automatically after the last round of training regardless of the number of communication rounds.

The log stops after displaying the last eval metric

18521 2021-12-29,21:14:53.265 - {tc_transformer_trainer.py (180)} - eval_model(): best_accuracy = 0.000000
18521 2021-12-29,21:14:53.266 - {tc_transformer_trainer.py (188)} - eval_model(): {'mcc': 0.0, 'tp': 0, 'tn': 0, 'fp': 0, 'fn': 0, 'acc': 0.0, 'eval_loss': 3.01809245740279}

Commenting out post_complete_message_to_sweep_process(self.args) on ClientManger and ServerManger does abort the program, so it seems something with FIFO is the problem. Will commenting out the function cause any problem?

Possibly related to an issue from FedML.

Come up problems in Running Text Classification model with distilbert

Our environment: Win10
During the installation, strictly follow the python version and configuration steps in the README.md file.

When we run the test the dependencies command:
python -m model.fed_transformers.test
report an error:
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
When we run Text Classification model with distilbert. Report an error:
requests.exceptions.ProxyError: HTTPSConnectionPool(host='api.wandb.ai', port=443): Max retries exceeded with url: /graphql (Caused by ProxyError('Cannot connect to proxy.', OSError(0, 'Error'))) wandb: Network error (ProxyError), entering retry loop. wandb: Network error (ProxyError), entering retry loop.
The specific error can be found in the file error_log.txt.error log.txt We have studied for a long time but have no clue, which is very confusing. Please give us some advice. Thank you!

Issue when running centralized/bilstm_exps/main_text_classification.py

There are no such files while creating dataloaders.

QA does not calculate F1 score result , may I know how to fix it?

root:epoch = 1, batch_idx = 2164/5520, loss = 0.6054576635360718
INFO:root:epoch = 1, batch_idx = 2165/5520, loss = 0.4317961633205414
INFO:root:epoch = 1, batch_idx = 2166/5520, loss = 1.390831470489502
INFO:root:epoch = 1, batch_idx = 2167/5520, loss = 1.07370924949646
INFO:root:epoch = 1, batch_idx = 2168/5520, loss = 1.1164920330047607
INFO:root:epoch = 1, batch_idx = 2169/5520, loss = 0.6192945241928101
INFO:root:epoch = 1, batch_idx = 2170/5520, loss = 0.7042073607444763
INFO:root:epoch = 1, batch_idx = 2171/5520, loss = 0.702218770980835
INFO:root:epoch = 1, batch_idx = 2172/5520, loss = 0.47233307361602783
INFO:root:epoch = 1, batch_idx = 2173/5520, loss = 0.547944188117981
INFO:root:epoch = 1, batch_idx = 2174/5520, loss = 0.703558087348938
INFO:root:epoch = 1, batch_idx = 2175/5520, loss = 0.793656587600708
INFO:root:epoch = 1, batch_idx = 2176/5520, loss = 0.790333092212677
INFO:root:epoch = 1, batch_idx = 2177/5520, loss = 0.5816390514373779
INFO:root:epoch = 1, batch_idx = 2178/5520, loss = 0.9623005986213684
INFO:root:epoch = 1, batch_idx = 2179/5520, loss = 0.6054102182388306
INFO:root:cached_features_file = cache_dir/cached_dev_bert_256_34726
INFO:examples.question_answering.question_answering_model: Features loaded from cache at cache_dir/cached_dev_bert_256_34726
Running Evaluation: 100%|██████████| 2203/2203 [03:30<00:00, 10.46it/s]
INFO:root:{}
INFO:examples.question_answering.question_answering_model:{'correct': 15510, 'similar': 14006, 'incorrect': 5210, 'eval_loss': -7.354878266291677}
INFO:root:epoch = 1, batch_idx = 2180/5520, loss = 0.5322230458259583
INFO:root:epoch = 1, batch_idx = 2181/5520, loss = 0.5274325609207153
INFO:root:epoch = 1, batch_idx = 2182/5520, loss = 0.5773954391479492
INFO:root:epoch = 1, batch_idx = 2183/5520, loss = 0.8108208775520325

Error running uniform partition for text classification

Hi.
I am encountering a EOFError when trying to run uniform partition for text classification.

run_text_classification.sh FedOPT "uniform" 5e-5 0.1 51 4

27440 2022-01-09,13:28:37.575 - {base_data_manager.py (306)} - _load_data_loader_from_cache(): Loading features from cached file cache_dir/distilbert_distilbert-base-uncased_cached_256_ClassificationModel_20news_uniform_75
Traceback (most recent call last):
File "/home/ky/miniconda3/envs/fednlp/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/ky/miniconda3/envs/fednlp/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/ky/Research/NLP/FL/FedNLP/experiments/distributed/transformer_exps/run_tc_exps/fedavg_main_tc.py", line 140, in
train_data_local_dict, test_data_local_dict, num_clients = dm.load_federated_data(process_id=process_id)
File "/home/ky/Research/NLP/FL/FedNLP/data_manager/base_data_manager.py", line 142, in load_federated_data
return self._load_federated_data_local()
File "/home/ky/Research/NLP/FL/FedNLP/data_manager/base_data_manager.py", line 240, in _load_federated_data_local
state, res = self._load_data_loader_from_cache(client_idx)
File "/home/ky/Research/NLP/FL/FedNLP/data_manager/base_data_manager.py", line 309, in _load_data_loader_from_cache
train_examples, train_features, train_dataset, test_examples, test_features, test_dataset = pickle.load(handle)
EOFError: Ran out of input

Non-iid partition method works fine.
Any suggestion on how to fix this?

Thanks.

Too many unused configuration/functionalities in the model trainer class (e.g., model/fed_transformers/classification/classification_model.py

for example, ONNX, quantization

Another issue is that a single function has 400 lines of code. It's better to simplify this class for FL. The current code is mainly for centralized training. Can we do it in 100 lines or split it as more functions or classes?

Come up problems in Running Text Classification model with No module named transformers

Hi,
Our environment: Ubuntu16.04
During the installation, strictly follow the python version and configuration steps in the README.md file.It is confirmed that all the required dependent files have been installed, but the following error occurs as soon as the file is run 'bash run_simulation.sh'.
1.When we run the test the dependencies command:
bash run_simulation.sh
The error shown in the figure below appears.

Please give us some advice. Thank you!

About large model

I want to know how do you maintain the parameters of each large model (such as Bert) in the process of federated learning, such as the fedavg algorithm? Because before server aggregation, if you run federated learning locally, you need to save many model parameters in memory

KeyError: 'Unable to open object (bad heap free list)'

When I use 20news for classification, I get this error, can anyone help me? I have got the dataset from here. https://fednlp.s3-us-west-1.amazonaws.com/partition_files/20news_partition.h5 https://fednlp.s3-us-west-1.amazonaws.com/data_files/20news_data.h5

Loading data from h5 file.: 0%| | 0/11314 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/root/miniconda3/envs/fednlp/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/root/miniconda3/envs/fednlp/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/FedNLP-master/experiments/centralized/transformer_exps/main_tc.py", line 91, in
train_dl, test_dl = dm.load_centralized_data()
File "/home/FedNLP-master/data_manager/base_data_manager.py", line 112, in load_centralized_data
train_data = self.read_instance_from_h5(data_file, train_index_list)
File "/home/FedNLP-master/data_manager/text_classification_data_manager.py", line 23, in read_instance_from_h5
X.append(data_file["X"][str(idx)][()].decode("utf-8"))
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "/root/miniconda3/envs/fednlp/lib/python3.7/site-packages/h5py/_hl/group.py", line 305, in getitem
oid = h5o.open(self.id, self._e(name), lapl=self._lapl)
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "h5py/h5o.pyx", line 190, in h5py.h5o.open
KeyError: 'Unable to open object (bad heap free list)'

when running “sh run_text_classification.sh” fedavg_main_tc.py: error: argument --client_num_per_round: expected one argument

How to deal with it

[SQuAD] crashed when doing official evaluation

in eval_model_by_offical_script
if n not in custom_parameter_names and not any(nd in n for nd in no_decay)
IndexError: list index out of range

when run "bash run_simulation.sh 5" , error "MPI_ABORT was invoked on......."

how to change the number of client

Hi.
When I try to change the total number of clients in a distributed cluster,I used "--client_num_in_total" to specify it, but it didn't work.
Could you tell me how to change the total number of client? Thank you!

[IMPORTANT]Client Sampling Frozen

Hello authors,

I'm currently implementing your work on text classification on 20news dataset. I'm using single Nvidia A6000 for this task with FedOPT algorithm, total client 50 and 2 clients per round.

After the data are loaded, once the training process comes to the client sampling part, it freezed like this:
1133831 2022-03-08,21:49:04.477 - {gpu_mapping.py (22)} - mapping_processes_to_gpu_device_from_yaml_file(): gpu_util = {'ChaoyangHe-GPU-RTX2080Tix4': [1, 0]}
1133831 2022-03-08,21:49:04.477 - {gpu_mapping.py (31)} - mapping_processes_to_gpu_device_from_yaml_file(): Process 0 running on host: ChaoyangHe-GPU-RTX2080Tix4, gethostname: lacrymosa.ics.uci.edu, local_gpu_id: 0 ...
1133831 2022-03-08,21:49:04.477 - {gpu_mapping.py (32)} - mapping_processes_to_gpu_device_from_yaml_file(): i = 1, worker_number = 1
1133831 2022-03-08,21:49:04.511 - {gpu_mapping.py (37)} - mapping_processes_to_gpu_device_from_yaml_file(): process_id = 0, GPU device = cuda:0
1133831 2022-03-08,21:49:04.511 - {fedavg_main_tc.py (84)} - (): process_id = 0, size = 1, device=cuda:0
1133831 2022-03-08,21:49:04.512 - {fedavg_main_tc.py (85)} - (): torch.cuda.current_device()=0
1133831 2022-03-08,21:49:04.512 - {fedavg_main_tc.py (86)} - (): torch.cuda.device_count()=2
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.weight', 'vocab_layer_norm.bias', 'vocab_projector.bias', 'vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight']

This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.weight', 'pre_classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Loading index from h5 file.: 100%|██████████| 100/100 [00:00<00:00, 2576.67it/s]
1133831 2022-03-08,21:49:07.753 - {base_data_manager.py (183)} - _load_federated_data_server(): caching test index size 7532test cut off None
Loading data from h5 file.: 100%|██████████| 7532/7532 [00:02<00:00, 3057.08it/s]
100%|██████████| 7532/7532 [00:03<00:00, 2410.62it/s]
1133831 2022-03-08,21:49:13.718 - {text_classification_preprocessor.py (145)} - transform_features(): 7532 features created from 7532 samples.
1133831 2022-03-08,21:49:13.764 - {base_data_manager.py (196)} - _load_federated_data_server(): caching test data size 7532
1133831 2022-03-08,21:49:13.858 - {base_data_manager.py (219)} - _load_federated_data_server(): test_dl_global number = 942
1133831 2022-03-08,21:49:13.861 - {FedOptAggregator.py (132)} - client_sampling(): client_indexes = [26 86]

I dont know why this is happening. Could you help me with this issue?

(Distributed Data Loader)The data loading time for sentment140 is too long. I waited for more than half an hour.

Running this for loop is extremely loog.

FedNLP/experiments/distributed/transformer_exps/text_classification_fedavg.py

Line 160 in bd6dbb9

for idx in range(data_attr["n_clients"]):

fedml-ai / fednlp Goto Github PK

fednlp's Introduction

fednlp's People

Contributors

Stargazers

Watchers

Forkers

fednlp's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs