asus-aics / libmultilabel Goto Github PK

A library for multi-class and multi-label classification

License: MIT License

Python 96.16% Shell 3.84%

multilabel-classification multiclass-classification multi-label-classification text-classificaiton external-code

libmultilabel's Introduction

LibMultiLabel — a Library for Multi-class and Multi-label Classification

LibMultiLabel is a library for binary, multi-class, and multi-label classification. It has the following functionalities

end-to-end services from raw texts to final evaluation/analysis
support for common neural network architectures and linear classifiers
easy hyper-parameter selection

This is an on-going development so many improvements are still being made. Comments are very welcome.

Environments

Python: 3.8+
CUDA: 11.8, 12.1 (if training neural networks by GPU)
Pytorch: 2.0.1+

If you have a different version of CUDA, follow the installation instructions for PyTorch LTS at their website.

Documentation

See the documentation here: https://www.csie.ntu.edu.tw/~cjlin/libmultilabel

libmultilabel's People

Contributors

Stargazers

Watchers

libmultilabel's Issues

Running linear models requires nn dependencies

main.py uses libmultilabel/logging.py which unconditionally imports transformers.
This makes it impossible to run linear models without installing transformers.

Also, stream_handler in main.py is never used

LibMultiLabel/main.py

Line 195 in c5103b4

stream_handler = add_stream_handler(log_level)

Specifying behaviour when text has tabs

When running on Wiki10-31K from libsvm datasets, I encountered problems with having stray \t in the text.
Specifically, the first occurrence is at line 322, where the text after the first \t is cut off.
The readme should (currently) be explicit that \t is forbidden in the text.

On the other hand, the data loading should either handle such situations or emit an error, not run silently and erroneously.
There are two situations:

If there is two \t in a line, there is ambiguity due to permitting an optional id column. Perhaps require the entire file to have the same column count.
If there is three or more \t in a line, every \t after the second shouldn't be treated as a separator, it is either part of the text or an error.

The relevant code is

LibMultiLabel/libmultilabel/data_utils.py

Lines 80 to 82 in 260850c

 data = pd.read_csv(path, sep='\t', names=['label', 'text'], 

 converters={'label': lambda s: s.split(), 

 'text': tokenize})

Question regarding adding own vocab and embeddings

Hello developers, if I want to add my own embeddings such as Elmo or other so first I have to collect vocab for whole data for example :

if my data looks like this:

import numpy as np

sentences = [['this', 'is', 'the', 'first', 'sentence', 'for', 'emd'],
			       ['this', 'is', 'the', 'second', 'sentence'],
			       ['yet', 'another', 'sentence'],
			       ['one', 'more', 'sentence'],
			       ['and', 'the', 'final', 'sentence']]

I have created the vocab like this:

vocab = list(set([word for list_word in sentences for word in list_word]))

And for example my elmo embedding matrix looks like this:

embed = np.random.uniform(-1,1,[len(vocab), 1024])

Later I am transforming my embedding in .txt file such as:

team -0.17901678383350372 1.2405720949172974 ...
language 0.8667483925819397 5.001194953918457 ...

My question is do I need to add 'UNK' and 'PAD' etc in my vocab or my current code is fine?

About different model architecture

This pytorch sample code is very useful and easy to understand.
Is it possible to develop RNN, RCNN, etc. model architecture?
thanks.

How to input raw text with command line interface for linear classifiers?

Hi,

When I check linear classifiers in web documentation, I find an inconsistency between the usage of cli and api.

Specifically, cli says that the input should be a SVM file (with tf-idf features).

On the other hand, api allows the input to be a TXT file (raw text).

I think cli can also support raw text input, but I am not sure how to do it.

requirements.txt is needed for environment setting

Packages like "python-dotenv, nltk, torchtext, pyyaml" are lacking in the Environment section.

Lack of license information

Hi,

After checking the content of the repository, I didn't find an explicitly specific open source license declaration here.

It would be essential to tag an explicit open source license if you want to outreach your work as an open source project, as shown below.

"Can I call my program "Open Source" even if I don't use an approved license?
Please don't do that. If you call it "Open Source" without using an approved license, you will confuse people. This is not merely a theoretical concern — we have seen this confusion happen in the past, and it's part of the reason we have a formal license approval process. See also our page on license proliferation for why this is a problem." [1]

[1] https://opensource.org/faq#avoid-unapproved-licenses

It would be great if you can add the license information in SPDX format.

Just my two cents.

CLI cannot save positive labels

Currently main.py only offers to save the k labels with the highest decision values, but doesn't have an option to save all labels with positive decision values.
Confusingly, there are options to compute F1 scores, which are metrics for the use cases where we care about all labels with positive decision values.

Maybe let "evaluate()" take the output of model instead of model is a better and more general choice.

LibMultiLabel/libmultilabel/evaluate.py

Lines 11 to 31 in bbc31f5

 def evaluate(model, dataset_loader, monitor_metrics, label_key='label', silent=False): 

 """Evaluate model and add the predictions to MultiLabelMetrics. 

  Args: 

  model (Model): a high level class used to initialize network, predict examples and load/save model 

  dataset_loader (DataLoader): pytorch dataloader (torch.utils.data.DataLoader) 

  monitor_metrics (list): metrics to monitor while validating 

  label_key (str, optional): the key to label in the dataset. Defaults to 'label'. 

  """ 

 progress_bar = tqdm(dataset_loader, disable=silent) 

 eval_metric = MultiLabelMetrics(monitor_metrics=monitor_metrics) 

 for batch in progress_bar: 

 batch_labels = batch[label_key] 

 predict_results = model.predict(batch) 

 batch_label_scores = predict_results['scores'] 

 batch_labels = batch_labels.cpu().detach().numpy() 

 batch_label_scores = batch_label_scores.cpu().detach().numpy() 

 eval_metric.add_values(batch_labels, batch_label_scores) 

 return eval_metric

Clarifying PR 135

In #135, the nn behaviour is

ignore include_test_labels if label_file exists
ignore include_test_labels if there is no test file
default include_test_labels to false

Linear follows 1 and 3, but issues an error on 2, which I think might be an oversight?
The issue with 3 is the docstring says it defaults to true.

The convention for no option given is currently

invalid path strings for data paths, i.e. if os.path.exists(train_path):
None for label_file, i.e. if label_file:

This is rather confusing, should it be homogenized?

Linear in eval mode does not have multiclass

config.multiclass is only set in the else branch

LibMultiLabel/linear_trainer.py

Lines 53 to 66 in c5103b4

 if config.eval: 

 preprocessor, model = linear.load_pipeline(config.checkpoint_path) 

 datasets = preprocessor.load_data( 

 config.training_file, config.test_file, config.eval) 

 else: 

 preprocessor = linear.Preprocessor(data_format=config.data_format) 

 datasets = preprocessor.load_data( 

 config.training_file, 

 config.test_file, 

 config.eval, 

 config.label_file, 

 config.include_test_labels, 

 config.remove_no_label_data) 

 config.multiclass = is_multiclass_dataset(datasets['train'], label='y')

Which is later used in linear_test

LibMultiLabel/linear_trainer.py

Lines 12 to 17 in c5103b4

 def linear_test(config, model, datasets): 

 metrics = linear.get_metrics( 

 config.metric_threshold, 

 config.monitor_metrics, 

 datasets['test']['y'].shape[1], 

 multiclass=config.multiclass

causing a missing attribute error.

As an addendum, I think config should never be modified after main.py. This was a huge source of confusion on the nn side previously.

how to generate vocab.csv and processed_full.embed in MIMIC-50?

I was trying to run the MIMIC-50 examples and provided the data in the same format as specified.
There are two extra files apart from training data files which are vocab.csv and processed_full.embed.
There is no mention in docs or anywhere how to generate these files.

Kindly share the script or process on how to generate those files.
Also please specify what is difference between Macro-F1 and Another Macro-F1?
Thank you!

Can you add more documentation and result images?

Expecting a step by step execution details and images of results with any sample data

checkpoint_path & load_checkpoint

FYI,
It seems current main.py has no load_checkpoint argumentation.
But in docs/cli/nn.rst, it has --load_checkpoint in example usage.

Thank you so much for your hard work!

read_libsvm_format() does not handle examples with no labels

I think read_libsvm_format() should handle examples with no labels.
Although normal dataset should not have such problem.
It still sometimes happen.
For example, the mediamill dataset.
Also, it would be useful for creating a dataset with only a subset of labels from a multilabel dataset.

Suggested configuration for training NN models on highly imbalanced datasets

Hi!

I have a binary classification dataset with highly imbalanced label distributions (pos : neg == 1 : 200)

I was trying to apply the BERT code in Neural Network Quick Start Tutorial
directly on this dataset, with val metric set to "Macro-F1", but the trained model would mostly produce all negatives in this case.

I am wondering if there are parameters or configurations I could tune in LibMultiLabel for such an imbalanced dataset to improve the model's performance?

For your reference:

I also tried the linear method, where I saw using train_cost_sensitive instead of train_1vsrest improved noticeably on this issue. (with train_cost_sensitive, the model predicts 4 times more positive samples than with train_1vsrest. Although both methods have 'Micro-F1 and 'P@1' close to 0.99 (due to dominating negative samples) and Macro-F1 around 0.5)

Thanks!

Uninformative model initialization error in parameter search

If network_config is misconfigured for search_params.py, this is the error that you get

  File "/home/user/miniconda3/envs/ray/lib/python3.8/site-packages/ray/tune/trainable/trainable.py", line 355, in train
    raise skipped from exception_cause(skipped)
  File "/home/user/miniconda3/envs/ray/lib/python3.8/site-packages/ray/tune/trainable/function_trainable.py", line 325, in entrypoint
    return self._trainable_func(
  File "/home/user/miniconda3/envs/ray/lib/python3.8/site-packages/ray/tune/trainable/function_trainable.py", line 651, in _trainable_func
    output = fn()
  File "/home/user/miniconda3/envs/ray/lib/python3.8/site-packages/ray/tune/trainable/util.py", line 374, in _inner
    inner(config, checkpoint_dir=None)
  File "/home/user/miniconda3/envs/ray/lib/python3.8/site-packages/ray/tune/trainable/util.py", line 365, in inner
    trainable(config, **fn_kwargs)
  File "search_params.py", line 37, in train_libmultilabel_tune
    trainer = TorchTrainer(
  File "/home/user/workspace/LibMultiLabel/torch_trainer.py", line 69, in __init__
    self._setup_model(
  File "/home/user/workspace/LibMultiLabel/torch_trainer.py", line 154, in _setup_model
    self.model = init_model(
  File "/home/user/workspace/LibMultiLabel/libmultilabel/nn/nn_utils.py", line 95, in init_model
    raise AttributeError(f"Failed to initialize {model_name}.")
AttributeError: Failed to initialize KimCNN.

which gives you no clue what the problem is.
This is because of how the exception is being handled:

LibMultiLabel/libmultilabel/nn/nn_utils.py

Lines 92 to 95 in 415d212

 try: 

 network = getattr(networks, model_name)(embed_vecs=embed_vecs, num_classes=len(classes), **dict(network_config)) 

 except: 

 raise AttributeError(f"Failed to initialize {model_name}.")

This does not rethrow the exception, it raises another one, and that is what is being shown in the error.

Some minor issues

Why train_step() check if isinstance(outputs, dict) is True but predict() does not?

LibMultiLabel/libmultilabel/model.py

Line 177 in bbc31f5

outputs = self.network(inputs['text'])
The output size should be "(batch_size, num_filter, length - kernel_szie + 1)".

LibMultiLabel/libmultilabel/networks/kim_cnn.py

Line 35 in 19728f8

h_sub = conv(h) # (batch_size, num_filter, length)
It might be weird to set the activation function for some customized model inheriting from this basic class here.

LibMultiLabel/libmultilabel/networks/base.py

Line 19 in 19728f8

self.activation = getattr(F, config.activation)

#85 regression, cannot finish training

After 66d8b8f, running python main.py --config example_config/rcv1/kim_cnn.yml fails with

Traceback (most recent call last):
  File "main.py", line 175, in <module>
    main()
  File "main.py", line 167, in main
    trainer.train()
  File "/home/julien/workspace/LibMultiLabel/torch_trainer.py", line 135, in train
    logging.info(f'Finished training. Load best model from {self.checkpoint_callback.best_model_path}')
AttributeError: 'TorchTrainer' object has no attribute 'checkpoint_callback'

This is due to the changes to TorchTrainer._setup_trainer being refactored to self.trainer = init_trainer(...), where self.checkpoint_callback was previously assigned to.

Previously it was

LibMultiLabel/torch_trainer.py

Lines 102 to 106 in cf19391

 def _setup_trainer(self): 

 """Setup torch trainer and callbacks.""" 

 self.checkpoint_callback = ModelCheckpoint( 

 dirpath=self.checkpoint_dir, filename='best_model', save_last=True, 

 save_top_k=1, monitor=self.config.val_metric, mode='max')

Now it is

LibMultiLabel/torch_trainer.py

Lines 51 to 54 in 66d8b8f

self.trainer = init_trainer(checkpoint_dir=config.checkpoint_path,

epochs=config.epochs,

patience=config.patience,

val_metric=config.val_metric)

LibMultiLabel/libmultilabel/nn/nn_utils.py

Lines 101 to 125 in 66d8b8f

 def init_trainer(checkpoint_dir, 

 epochs=10000, 

 patience=5, 

 mode='max', 

 val_metric='P@1', 

 silent=False, 

 use_cpu=False): 

 """Initialize a torch lightning trainer. 

  Args: 

  checkpoint_dir (str): Directory for saving models and log. 

  epochs (int): Number of epochs to train. Defaults to 10000. 

  patience (int): Number of epochs to wait for improvement before early stopping. Defaults to 5. 

  mode (str): One of [min, max]. Decides whether the val_metric is minimizing or maximizing. 

  val_metric (str): The metric to monitor for early stopping. Defaults to 'P@1'. 

  silent (bool): Enable silent mode. Defaults to False. 

  use_cpu (bool): Disable CUDA. Defaults to False. 

  Returns: 

  pl.Trainer: A torch lightning trainer. 

  """ 

 checkpoint_callback = ModelCheckpoint( 

 dirpath=checkpoint_dir, filename='best_model', save_last=True, 

 save_top_k=1, monitor=val_metric, mode=mode)

	data = pd.read_csv(path, sep='\t', names=['label', 'text'],
	converters={'label': lambda s: s.split(),
	'text': tokenize})

	def evaluate(model, dataset_loader, monitor_metrics, label_key='label', silent=False):
	"""Evaluate model and add the predictions to MultiLabelMetrics.

	Args:
	model (Model): a high level class used to initialize network, predict examples and load/save model
	dataset_loader (DataLoader): pytorch dataloader (torch.utils.data.DataLoader)
	monitor_metrics (list): metrics to monitor while validating
	label_key (str, optional): the key to label in the dataset. Defaults to 'label'.
	"""
	progress_bar = tqdm(dataset_loader, disable=silent)
	eval_metric = MultiLabelMetrics(monitor_metrics=monitor_metrics)

	for batch in progress_bar:
	batch_labels = batch[label_key]
	predict_results = model.predict(batch)
	batch_label_scores = predict_results['scores']

	batch_labels = batch_labels.cpu().detach().numpy()
	batch_label_scores = batch_label_scores.cpu().detach().numpy()
	eval_metric.add_values(batch_labels, batch_label_scores)
	return eval_metric

	if config.eval:
	preprocessor, model = linear.load_pipeline(config.checkpoint_path)
	datasets = preprocessor.load_data(
	config.training_file, config.test_file, config.eval)
	else:
	preprocessor = linear.Preprocessor(data_format=config.data_format)
	datasets = preprocessor.load_data(
	config.training_file,
	config.test_file,
	config.eval,
	config.label_file,
	config.include_test_labels,
	config.remove_no_label_data)
	config.multiclass = is_multiclass_dataset(datasets['train'], label='y')

	def linear_test(config, model, datasets):
	metrics = linear.get_metrics(
	config.metric_threshold,
	config.monitor_metrics,
	datasets['test']['y'].shape[1],
	multiclass=config.multiclass

	try:
	network = getattr(networks, model_name)(embed_vecs=embed_vecs, num_classes=len(classes), **dict(network_config))
	except:
	raise AttributeError(f"Failed to initialize {model_name}.")

	def _setup_trainer(self):
	"""Setup torch trainer and callbacks."""
	self.checkpoint_callback = ModelCheckpoint(
	dirpath=self.checkpoint_dir, filename='best_model', save_last=True,
	save_top_k=1, monitor=self.config.val_metric, mode='max')

	self.trainer = init_trainer(checkpoint_dir=config.checkpoint_path,
	epochs=config.epochs,
	patience=config.patience,
	val_metric=config.val_metric)

	def init_trainer(checkpoint_dir,
	epochs=10000,
	patience=5,
	mode='max',
	val_metric='P@1',
	silent=False,
	use_cpu=False):
	"""Initialize a torch lightning trainer.

	Args:
	checkpoint_dir (str): Directory for saving models and log.
	epochs (int): Number of epochs to train. Defaults to 10000.
	patience (int): Number of epochs to wait for improvement before early stopping. Defaults to 5.
	mode (str): One of [min, max]. Decides whether the val_metric is minimizing or maximizing.
	val_metric (str): The metric to monitor for early stopping. Defaults to 'P@1'.
	silent (bool): Enable silent mode. Defaults to False.
	use_cpu (bool): Disable CUDA. Defaults to False.

	Returns:
	pl.Trainer: A torch lightning trainer.
	"""

	checkpoint_callback = ModelCheckpoint(
	dirpath=checkpoint_dir, filename='best_model', save_last=True,
	save_top_k=1, monitor=val_metric, mode=mode)

asus-aics / libmultilabel Goto Github PK

libmultilabel's Introduction

LibMultiLabel — a Library for Multi-class and Multi-label Classification

Environments

Documentation

libmultilabel's People

Contributors

Stargazers

Watchers

Forkers

libmultilabel's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs