GithubHelp home page GithubHelp logo

fedml-ai / fednlp Goto Github PK

View Code? Open in Web Editor NEW
223.0 12.0 45.0 12.21 MB

FedNLP: An Industry and Research Integrated Platform for Federated Learning in Natural Language Processing, Backed by FedML, Inc. The Previous Research Version is Accepted to NAACL 2022

federated-learning natural-language-processing nlp machine-learning

fednlp's Introduction

fednlp's People

Contributors

chaoyanghe avatar devirule avatar elliebababa avatar raymondtseng avatar sauravpr avatar yuchenlin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

fednlp's Issues

object doesn't exist for text classification script

If I run a text classification model with distilbert using:

DATA_NAME=20news
CUDA_VISIBLE_DEVICES=1 python -m experiments.centralized.transformer_exps.main_tc
--dataset ${DATA_NAME}
--data_file ~/fednlp_data/data_files/${DATA_NAME}_data.h5
--partition_file ~/fednlp_data/partition_files/${DATA_NAME}_partition.h5
--partition_method niid_label_clients=100.0_alpha=5.0
--model_type distilbert
--model_name distilbert-base-uncased
--do_lower_case True
--train_batch_size 32
--eval_batch_size 8
--max_seq_length 256
--learning_rate 5e-5
--epochs 20
--evaluate_during_training_steps 500
--output_dir /tmp/${DATA_NAME}_fed/
--n_gpu 1

I got as errror 'KeyError: "Unable to open object (object 'niid_label_clients=100.0_alpha=5.0' doesn't exist)"', but the object should exist?

The client side learning rate warm up and scheduler may be an issue for FL.

scheduler = get_linear_schedule_with_warmup(

Combining client side learning rate warmp and scheduler with a distributed optimizer (FedAvg, FedOpt, etc) looks unreasonable in current optimization theory. We need to confirm its effectiveness by experiments. My suggestion is deleting it if the experiments show that it does not improve the accuracy too much, because ML researcher may think this algorithmic combination is wrong. We should avoid such confusion from their mind.

Optimizer suggestion for federated learning experiments

1. Client optimizer (sgd) + server optimizer (adam).

Pro: good for cross-device FL since all devices do not need to synchronize the optimizer states;
cons: the accuracy will drop a bit? not sure. nobody has ever explored this based on Transformer. Maybe our paper has this contribution.

2. Client optimizer (adam) + server optimizer (adam).
Pro: good for accuracy
cons: can only work at the cross-silo setting

We will have a discussion of the optimizer performance in our benchmarking paper, which is viewed as a field guide for our benchmarking users.

Hyperparameters for reproducing the results of the paper?

Hi, thank you for the work.

I am confused regarding the learning rates used in the experiments.
The README.md (experiments/distributed/transformer_exps/README.md) have different server lr's (0.1, 5e-5) with that of the paper (Section 4.3). I am trying to reproduce some experiments as a baseline, but I am reaching either a higher or lower performance than the reported accuracy.

For the seq2seq task (Gigawords), could you report the server and client learning rate? or even better refer me to the wandb project?

Thanks

No such file or directory: 'cache_dir/distilbert_distilbert-base-uncased_cached_256_ClassificationModel_20news_niid_label_clients=100_alpha=5.0_-1'

If I run the script
"sh run_text_classification.sh FedAvg "niid_label_clients=100_alpha=5.0" 5e-5 0.1 2 0"
it returns an error:
File "~/workspace/FedNLP/data_manager/base_data_manager.py", line 198, in load_federated_data_server
with open(res, "wb") as handle:
FileNotFoundError: [Errno 2] No such file or directory: 'cache_dir/distilbert_distilbert-base-uncased_cached_256_ClassificationModel_20news_niid_label_clients=100_alpha=5.0
-1'

Bugs

Dear all,
We try to clone your repository to our own project. However, a lot of bugs remain.
For example:

  • problem with pip install mpi4py : we need to export MPICC=/usr/lib64/openmpi/bin/mpicc (or another address you may find with a locate command) (solved)
  • problem with installation of wandb (creating an account, wandb login, api key) (solved)
  • problem with entity = automl replacement (it is explicated): I've made a bash script to do it then (solved)
  • problem of missing partitions: we need to change some scripts to remove occurences (for example:

#all_partition_files = ['20news_partition.h5', 'agnews_partition.h5', 'cornell_movie_dialogue_partition.h5', 'onto_partition.h5', 'ploner_partition.h5', 'squad_1.1_partition.h5', 'sst_2_partition.h5', 'wikiner_partition.h5', 'w_nut_partition.h5']
Changed to
all_partition_files = ['20news_partition.h5', 'onto_partition.h5', 'squad_1.1_partition.h5'])

but we always have a KeyError: "Unable to open object (object 'niid_label_clients=100.0_alpha=5.0') doesn't exist"

  • etc.

Would it be possible to have a call about it? Or juste more information?
Thank you

lstm experiment data loader files missing

In the file 'experiments/centralized/bilstm_exps/main_text_classification':

import data_preprocessing.AGNews.data_loader import data_preprocessing.SST_2.data_loader import data_preprocessing.SemEval2010Task8.data_loader import data_preprocessing.Sentiment140.data_loader import data_preprocessing.news_20.data_loader

These loaders are lost.

Pretraining v.s. Fine-tuning

I got the training result using the default hyper-parameters in our current GitHub. Since we used pretrained weights, the accuracy is already as high as 79% in the first round, and several rounds already got the limit (around 83%). Thus too many round of federated training is unnecessary (our current default is 500, I terminated the training at 100). Check results here:

https://wandb.ai/automl/fednlp/runs/3oc7a3jc/logs?workspace=user-chaoyanghe-com

Given that NLP is dominated by transformer-based pretraining, in our FedNLP, we only need to do federated fine-tuning, right? Actually I have a research idea about federated pretraining for cross-silo setting, let's discuss more after ICML

Some suggestions/issues in "XXXModel" class

I am trying to simplify the code.

Some suggestions:

  1. I suggest finishing a function in a screen length (less than 100 lines). In some companies, this is hard rule. Even in NLP domain, like huggingface, they also follow this rule well. Their code is readable to me.

  2. I don't see some special training tricks, so the training loop can be finished in a screen length.

  3. when we want to repeat a functionality, it's better to extract it as a function. For example, 1) the early stopping related code repeats twice in the training loop, but the content is nearly the same; 2) defining the trainable parameters can be shrunk into a function (the begining part of train()).

  4. Many variant Transformers' code are merged with different branches. I suggest only consider the models in FL. As a benchmark, two models will be enough.

Some issues:

  1. this class also contains data loading in the training and evaluation loop, which should be handled outside the trainer class in FL, otherwise the performance may downgrade and some hidden issues may happen. Under FedML framework, the design pattern is to finish data loading of each client before starting the training loop.

(May update once I found more)

Could you support knowledge distillation-based FL algorithms like FedED, Fedmd, or FedDF

Currently, your platform supports some parameter-average-based FL algorithms. Could you support knowledge-distillation-based FL algorithms like FedED, Fedmd, or FedDF?

FedED: Federated Learning via Ensemble Distillation for Medical Relation Extraction,
Fedmd: Heterogenous federated learning via model distillation. NeurIPS workshop, 2019
FedDF, Ensemble distillation for robust model fusion in federated learning. NeurIPS, 2020

No dataloader in data_preprocessing

Hi, I notice that in the experiments/centralized/bilstm_exps/main_text_classification.py it imports different dataloaders for different tasks. For example:

import data_preprocessing.AGNews.data_loader
import data_preprocessing.SST_2.data_loader
import data_preprocessing.SemEval2010Task8.data_loader
import data_preprocessing.Sentiment140.data_loader
import data_preprocessing.news_20.data_loader

But I didn't find these dataloader in the data_preprocessing file. Are these code removed or lost?

Thanks,

Hanging after last round of training

Hi, thanks for the great work.

When running
sh run_text_classification.sh FedOPT "niid_label_clients=100_alpha=100.0" 1e-3 0.1 1 4,
the process does not terminate automatically after the last round of training regardless of the number of communication rounds.

The log stops after displaying the last eval metric

18521 2021-12-29,21:14:53.265 - {tc_transformer_trainer.py (180)} - eval_model(): best_accuracy = 0.000000
18521 2021-12-29,21:14:53.266 - {tc_transformer_trainer.py (188)} - eval_model(): {'mcc': 0.0, 'tp': 0, 'tn': 0, 'fp': 0, 'fn': 0, 'acc': 0.0, 'eval_loss': 3.01809245740279}

Commenting out post_complete_message_to_sweep_process(self.args) on ClientManger and ServerManger does abort the program, so it seems something with FIFO is the problem. Will commenting out the function cause any problem?

Possibly related to an issue from FedML.

Come up problems in Running Text Classification model with distilbert

Our environment: Win10
During the installation, strictly follow the python version and configuration steps in the README.md file.

  1. When we run the test the dependencies command:
    python -m model.fed_transformers.test
    report an error:
    Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
  2. When we run Text Classification model with distilbert. Report an error:
    requests.exceptions.ProxyError: HTTPSConnectionPool(host='api.wandb.ai', port=443): Max retries exceeded with url: /graphql (Caused by ProxyError('Cannot connect to proxy.', OSError(0, 'Error'))) wandb: Network error (ProxyError), entering retry loop. wandb: Network error (ProxyError), entering retry loop.
    The specific error can be found in the file error_log.txt.error log.txt We have studied for a long time but have no clue, which is very confusing. Please give us some advice. Thank you!

QA does not calculate F1 score result , may I know how to fix it?

root:epoch = 1, batch_idx = 2164/5520, loss = 0.6054576635360718
INFO:root:epoch = 1, batch_idx = 2165/5520, loss = 0.4317961633205414
INFO:root:epoch = 1, batch_idx = 2166/5520, loss = 1.390831470489502
INFO:root:epoch = 1, batch_idx = 2167/5520, loss = 1.07370924949646
INFO:root:epoch = 1, batch_idx = 2168/5520, loss = 1.1164920330047607
INFO:root:epoch = 1, batch_idx = 2169/5520, loss = 0.6192945241928101
INFO:root:epoch = 1, batch_idx = 2170/5520, loss = 0.7042073607444763
INFO:root:epoch = 1, batch_idx = 2171/5520, loss = 0.702218770980835
INFO:root:epoch = 1, batch_idx = 2172/5520, loss = 0.47233307361602783
INFO:root:epoch = 1, batch_idx = 2173/5520, loss = 0.547944188117981
INFO:root:epoch = 1, batch_idx = 2174/5520, loss = 0.703558087348938
INFO:root:epoch = 1, batch_idx = 2175/5520, loss = 0.793656587600708
INFO:root:epoch = 1, batch_idx = 2176/5520, loss = 0.790333092212677
INFO:root:epoch = 1, batch_idx = 2177/5520, loss = 0.5816390514373779
INFO:root:epoch = 1, batch_idx = 2178/5520, loss = 0.9623005986213684
INFO:root:epoch = 1, batch_idx = 2179/5520, loss = 0.6054102182388306
INFO:root:cached_features_file = cache_dir/cached_dev_bert_256_34726
INFO:examples.question_answering.question_answering_model: Features loaded from cache at cache_dir/cached_dev_bert_256_34726
Running Evaluation: 100%|██████████| 2203/2203 [03:30<00:00, 10.46it/s]
INFO:root:{}
INFO:examples.question_answering.question_answering_model:{'correct': 15510, 'similar': 14006, 'incorrect': 5210, 'eval_loss': -7.354878266291677}
INFO:root:epoch = 1, batch_idx = 2180/5520, loss = 0.5322230458259583
INFO:root:epoch = 1, batch_idx = 2181/5520, loss = 0.5274325609207153
INFO:root:epoch = 1, batch_idx = 2182/5520, loss = 0.5773954391479492
INFO:root:epoch = 1, batch_idx = 2183/5520, loss = 0.8108208775520325

Error running uniform partition for text classification

Hi.
I am encountering a EOFError when trying to run uniform partition for text classification.

run_text_classification.sh FedOPT "uniform" 5e-5 0.1 51 4

27440 2022-01-09,13:28:37.575 - {base_data_manager.py (306)} - _load_data_loader_from_cache(): Loading features from cached file cache_dir/distilbert_distilbert-base-uncased_cached_256_ClassificationModel_20news_uniform_75
Traceback (most recent call last):
File "/home/ky/miniconda3/envs/fednlp/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/ky/miniconda3/envs/fednlp/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/ky/Research/NLP/FL/FedNLP/experiments/distributed/transformer_exps/run_tc_exps/fedavg_main_tc.py", line 140, in
train_data_local_dict, test_data_local_dict, num_clients = dm.load_federated_data(process_id=process_id)
File "/home/ky/Research/NLP/FL/FedNLP/data_manager/base_data_manager.py", line 142, in load_federated_data
return self._load_federated_data_local()
File "/home/ky/Research/NLP/FL/FedNLP/data_manager/base_data_manager.py", line 240, in _load_federated_data_local
state, res = self._load_data_loader_from_cache(client_idx)
File "/home/ky/Research/NLP/FL/FedNLP/data_manager/base_data_manager.py", line 309, in _load_data_loader_from_cache
train_examples, train_features, train_dataset, test_examples, test_features, test_dataset = pickle.load(handle)
EOFError: Ran out of input

Non-iid partition method works fine.
Any suggestion on how to fix this?

Thanks.

Come up problems in Running Text Classification model with No module named transformers

Hi,
Our environment: Ubuntu16.04
During the installation, strictly follow the python version and configuration steps in the README.md file.It is confirmed that all the required dependent files have been installed, but the following error occurs as soon as the file is run 'bash run_simulation.sh'.
1.When we run the test the dependencies command:
bash run_simulation.sh
The error shown in the figure below appears.
image

Please give us some advice. Thank you!

About large model

I want to know how do you maintain the parameters of each large model (such as Bert) in the process of federated learning, such as the fedavg algorithm? Because before server aggregation, if you run federated learning locally, you need to save many model parameters in memory

KeyError: 'Unable to open object (bad heap free list)'

When I use 20news for classification, I get this error, can anyone help me? I have got the dataset from here. https://fednlp.s3-us-west-1.amazonaws.com/partition_files/20news_partition.h5 https://fednlp.s3-us-west-1.amazonaws.com/data_files/20news_data.h5

Loading data from h5 file.: 0%| | 0/11314 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/root/miniconda3/envs/fednlp/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/root/miniconda3/envs/fednlp/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/FedNLP-master/experiments/centralized/transformer_exps/main_tc.py", line 91, in
train_dl, test_dl = dm.load_centralized_data()
File "/home/FedNLP-master/data_manager/base_data_manager.py", line 112, in load_centralized_data
train_data = self.read_instance_from_h5(data_file, train_index_list)
File "/home/FedNLP-master/data_manager/text_classification_data_manager.py", line 23, in read_instance_from_h5
X.append(data_file["X"][str(idx)][()].decode("utf-8"))
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "/root/miniconda3/envs/fednlp/lib/python3.7/site-packages/h5py/_hl/group.py", line 305, in getitem
oid = h5o.open(self.id, self._e(name), lapl=self._lapl)
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "h5py/h5o.pyx", line 190, in h5py.h5o.open
KeyError: 'Unable to open object (bad heap free list)'

how to change the number of client

Hi.
When I try to change the total number of clients in a distributed cluster,I used "--client_num_in_total" to specify it, but it didn't work.
Could you tell me how to change the total number of client? Thank you!

[IMPORTANT]Client Sampling Frozen

Hello authors,

I'm currently implementing your work on text classification on 20news dataset. I'm using single Nvidia A6000 for this task with FedOPT algorithm, total client 50 and 2 clients per round.

After the data are loaded, once the training process comes to the client sampling part, it freezed like this:
1133831 2022-03-08,21:49:04.477 - {gpu_mapping.py (22)} - mapping_processes_to_gpu_device_from_yaml_file(): gpu_util = {'ChaoyangHe-GPU-RTX2080Tix4': [1, 0]}
1133831 2022-03-08,21:49:04.477 - {gpu_mapping.py (31)} - mapping_processes_to_gpu_device_from_yaml_file(): Process 0 running on host: ChaoyangHe-GPU-RTX2080Tix4, gethostname: lacrymosa.ics.uci.edu, local_gpu_id: 0 ...
1133831 2022-03-08,21:49:04.477 - {gpu_mapping.py (32)} - mapping_processes_to_gpu_device_from_yaml_file(): i = 1, worker_number = 1
1133831 2022-03-08,21:49:04.511 - {gpu_mapping.py (37)} - mapping_processes_to_gpu_device_from_yaml_file(): process_id = 0, GPU device = cuda:0
1133831 2022-03-08,21:49:04.511 - {fedavg_main_tc.py (84)} - (): process_id = 0, size = 1, device=cuda:0
1133831 2022-03-08,21:49:04.512 - {fedavg_main_tc.py (85)} - (): torch.cuda.current_device()=0
1133831 2022-03-08,21:49:04.512 - {fedavg_main_tc.py (86)} - (): torch.cuda.device_count()=2
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.weight', 'vocab_layer_norm.bias', 'vocab_projector.bias', 'vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight']

  • This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
  • This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
    Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.weight', 'pre_classifier.bias']
    You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
    Loading index from h5 file.: 100%|██████████| 100/100 [00:00<00:00, 2576.67it/s]
    1133831 2022-03-08,21:49:07.753 - {base_data_manager.py (183)} - _load_federated_data_server(): caching test index size 7532test cut off None
    Loading data from h5 file.: 100%|██████████| 7532/7532 [00:02<00:00, 3057.08it/s]
    100%|██████████| 7532/7532 [00:03<00:00, 2410.62it/s]
    1133831 2022-03-08,21:49:13.718 - {text_classification_preprocessor.py (145)} - transform_features(): 7532 features created from 7532 samples.
    1133831 2022-03-08,21:49:13.764 - {base_data_manager.py (196)} - _load_federated_data_server(): caching test data size 7532
    1133831 2022-03-08,21:49:13.858 - {base_data_manager.py (219)} - _load_federated_data_server(): test_dl_global number = 942
    1133831 2022-03-08,21:49:13.861 - {FedOptAggregator.py (132)} - client_sampling(): client_indexes = [26 86]

I dont know why this is happening. Could you help me with this issue?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.