GithubHelp home page GithubHelp logo

asus-aics / libmultilabel Goto Github PK

View Code? Open in Web Editor NEW
147.0 10.0 29.0 1.75 MB

A library for multi-class and multi-label classification

License: MIT License

Python 96.16% Shell 3.84%
multilabel-classification multiclass-classification multi-label-classification text-classificaiton external-code

libmultilabel's Introduction

LibMultiLabel — a Library for Multi-class and Multi-label Classification

LibMultiLabel is a library for binary, multi-class, and multi-label classification. It has the following functionalities

  • end-to-end services from raw texts to final evaluation/analysis
  • support for common neural network architectures and linear classifiers
  • easy hyper-parameter selection

This is an on-going development so many improvements are still being made. Comments are very welcome.

Environments

  • Python: 3.8+
  • CUDA: 11.8, 12.1 (if training neural networks by GPU)
  • Pytorch: 2.0.1+

If you have a different version of CUDA, follow the installation instructions for PyTorch LTS at their website.

Documentation

See the documentation here: https://www.csie.ntu.edu.tw/~cjlin/libmultilabel

libmultilabel's People

Contributors

aics-review-bot avatar chengyehli avatar chihming avatar cjlin1 avatar donglihe-hub avatar eleven1liu avatar ericliu8168 avatar gordon119 avatar henryyang42 avatar jameslyc88 avatar sammer1107 avatar sinacam avatar thomas0125 avatar tic66777 avatar tonmoregulus avatar will945945945 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

libmultilabel's Issues

Specifying behaviour when text has tabs

When running on Wiki10-31K from libsvm datasets, I encountered problems with having stray \t in the text.
Specifically, the first occurrence is at line 322, where the text after the first \t is cut off.
The readme should (currently) be explicit that \t is forbidden in the text.

On the other hand, the data loading should either handle such situations or emit an error, not run silently and erroneously.
There are two situations:

  • If there is two \t in a line, there is ambiguity due to permitting an optional id column. Perhaps require the entire file to have the same column count.
  • If there is three or more \t in a line, every \t after the second shouldn't be treated as a separator, it is either part of the text or an error.

The relevant code is

data = pd.read_csv(path, sep='\t', names=['label', 'text'],
converters={'label': lambda s: s.split(),
'text': tokenize})

Question regarding adding own vocab and embeddings

Hello developers, if I want to add my own embeddings such as Elmo or other so first I have to collect vocab for whole data for example :

if my data looks like this:

import numpy as np

sentences = [['this', 'is', 'the', 'first', 'sentence', 'for', 'emd'],
			       ['this', 'is', 'the', 'second', 'sentence'],
			       ['yet', 'another', 'sentence'],
			       ['one', 'more', 'sentence'],
			       ['and', 'the', 'final', 'sentence']]

I have created the vocab like this:

vocab = list(set([word for list_word in sentences for word in list_word]))

And for example my elmo embedding matrix looks like this:

embed = np.random.uniform(-1,1,[len(vocab), 1024])

Later I am transforming my embedding in .txt file such as:

team -0.17901678383350372 1.2405720949172974 ...
language 0.8667483925819397 5.001194953918457 ...

My question is do I need to add 'UNK' and 'PAD' etc in my vocab or my current code is fine?

About different model architecture

This pytorch sample code is very useful and easy to understand.
Is it possible to develop RNN, RCNN, etc. model architecture?
thanks.

How to input raw text with command line interface for linear classifiers?

Hi,

When I check linear classifiers in web documentation, I find an inconsistency between the usage of cli and api.

Specifically, cli says that the input should be a SVM file (with tf-idf features).
image

On the other hand, api allows the input to be a TXT file (raw text).
image

I think cli can also support raw text input, but I am not sure how to do it.

Lack of license information

Hi,

After checking the content of the repository, I didn't find an explicitly specific open source license declaration here.

It would be essential to tag an explicit open source license if you want to outreach your work as an open source project, as shown below.

"Can I call my program "Open Source" even if I don't use an approved license?
Please don't do that. If you call it "Open Source" without using an approved license, you will confuse people. This is not merely a theoretical concern — we have seen this confusion happen in the past, and it's part of the reason we have a formal license approval process. See also our page on license proliferation for why this is a problem." [1]

[1] https://opensource.org/faq#avoid-unapproved-licenses

It would be great if you can add the license information in SPDX format.

Just my two cents.

SZ

CLI cannot save positive labels

Currently main.py only offers to save the k labels with the highest decision values, but doesn't have an option to save all labels with positive decision values.
Confusingly, there are options to compute F1 scores, which are metrics for the use cases where we care about all labels with positive decision values.

Maybe let "evaluate()" take the output of model instead of model is a better and more general choice.

def evaluate(model, dataset_loader, monitor_metrics, label_key='label', silent=False):
"""Evaluate model and add the predictions to MultiLabelMetrics.
Args:
model (Model): a high level class used to initialize network, predict examples and load/save model
dataset_loader (DataLoader): pytorch dataloader (torch.utils.data.DataLoader)
monitor_metrics (list): metrics to monitor while validating
label_key (str, optional): the key to label in the dataset. Defaults to 'label'.
"""
progress_bar = tqdm(dataset_loader, disable=silent)
eval_metric = MultiLabelMetrics(monitor_metrics=monitor_metrics)
for batch in progress_bar:
batch_labels = batch[label_key]
predict_results = model.predict(batch)
batch_label_scores = predict_results['scores']
batch_labels = batch_labels.cpu().detach().numpy()
batch_label_scores = batch_label_scores.cpu().detach().numpy()
eval_metric.add_values(batch_labels, batch_label_scores)
return eval_metric

Clarifying PR 135

In #135, the nn behaviour is

  1. ignore include_test_labels if label_file exists
  2. ignore include_test_labels if there is no test file
  3. default include_test_labels to false

Linear follows 1 and 3, but issues an error on 2, which I think might be an oversight?
The issue with 3 is the docstring says it defaults to true.

The convention for no option given is currently

  1. invalid path strings for data paths, i.e. if os.path.exists(train_path):
  2. None for label_file, i.e. if label_file:

This is rather confusing, should it be homogenized?

Linear in eval mode does not have multiclass

config.multiclass is only set in the else branch

if config.eval:
preprocessor, model = linear.load_pipeline(config.checkpoint_path)
datasets = preprocessor.load_data(
config.training_file, config.test_file, config.eval)
else:
preprocessor = linear.Preprocessor(data_format=config.data_format)
datasets = preprocessor.load_data(
config.training_file,
config.test_file,
config.eval,
config.label_file,
config.include_test_labels,
config.remove_no_label_data)
config.multiclass = is_multiclass_dataset(datasets['train'], label='y')

Which is later used in linear_test
def linear_test(config, model, datasets):
metrics = linear.get_metrics(
config.metric_threshold,
config.monitor_metrics,
datasets['test']['y'].shape[1],
multiclass=config.multiclass

causing a missing attribute error.

As an addendum, I think config should never be modified after main.py. This was a huge source of confusion on the nn side previously.

how to generate vocab.csv and processed_full.embed in MIMIC-50?

I was trying to run the MIMIC-50 examples and provided the data in the same format as specified.
There are two extra files apart from training data files which are vocab.csv and processed_full.embed.
There is no mention in docs or anywhere how to generate these files.

Kindly share the script or process on how to generate those files.
Also please specify what is difference between Macro-F1 and Another Macro-F1?
Thank you!

Suggested configuration for training NN models on highly imbalanced datasets

Hi!

I have a binary classification dataset with highly imbalanced label distributions (pos : neg == 1 : 200)

I was trying to apply the BERT code in Neural Network Quick Start Tutorial
directly on this dataset, with val metric set to "Macro-F1", but the trained model would mostly produce all negatives in this case.

I am wondering if there are parameters or configurations I could tune in LibMultiLabel for such an imbalanced dataset to improve the model's performance?

For your reference:

I also tried the linear method, where I saw using train_cost_sensitive instead of train_1vsrest improved noticeably on this issue. (with train_cost_sensitive, the model predicts 4 times more positive samples than with train_1vsrest. Although both methods have 'Micro-F1 and 'P@1' close to 0.99 (due to dominating negative samples) and Macro-F1 around 0.5)

Thanks!

Uninformative model initialization error in parameter search

If network_config is misconfigured for search_params.py, this is the error that you get

  File "/home/user/miniconda3/envs/ray/lib/python3.8/site-packages/ray/tune/trainable/trainable.py", line 355, in train
    raise skipped from exception_cause(skipped)
  File "/home/user/miniconda3/envs/ray/lib/python3.8/site-packages/ray/tune/trainable/function_trainable.py", line 325, in entrypoint
    return self._trainable_func(
  File "/home/user/miniconda3/envs/ray/lib/python3.8/site-packages/ray/tune/trainable/function_trainable.py", line 651, in _trainable_func
    output = fn()
  File "/home/user/miniconda3/envs/ray/lib/python3.8/site-packages/ray/tune/trainable/util.py", line 374, in _inner
    inner(config, checkpoint_dir=None)
  File "/home/user/miniconda3/envs/ray/lib/python3.8/site-packages/ray/tune/trainable/util.py", line 365, in inner
    trainable(config, **fn_kwargs)
  File "search_params.py", line 37, in train_libmultilabel_tune
    trainer = TorchTrainer(
  File "/home/user/workspace/LibMultiLabel/torch_trainer.py", line 69, in __init__
    self._setup_model(
  File "/home/user/workspace/LibMultiLabel/torch_trainer.py", line 154, in _setup_model
    self.model = init_model(
  File "/home/user/workspace/LibMultiLabel/libmultilabel/nn/nn_utils.py", line 95, in init_model
    raise AttributeError(f"Failed to initialize {model_name}.")
AttributeError: Failed to initialize KimCNN.

which gives you no clue what the problem is.
This is because of how the exception is being handled:

try:
network = getattr(networks, model_name)(embed_vecs=embed_vecs, num_classes=len(classes), **dict(network_config))
except:
raise AttributeError(f"Failed to initialize {model_name}.")

This does not rethrow the exception, it raises another one, and that is what is being shown in the error.

Some minor issues

#85 regression, cannot finish training

After 66d8b8f, running python main.py --config example_config/rcv1/kim_cnn.yml fails with

Traceback (most recent call last):
  File "main.py", line 175, in <module>
    main()
  File "main.py", line 167, in main
    trainer.train()
  File "/home/julien/workspace/LibMultiLabel/torch_trainer.py", line 135, in train
    logging.info(f'Finished training. Load best model from {self.checkpoint_callback.best_model_path}')
AttributeError: 'TorchTrainer' object has no attribute 'checkpoint_callback'

This is due to the changes to TorchTrainer._setup_trainer being refactored to self.trainer = init_trainer(...), where self.checkpoint_callback was previously assigned to.

Previously it was

  • def _setup_trainer(self):
    """Setup torch trainer and callbacks."""
    self.checkpoint_callback = ModelCheckpoint(
    dirpath=self.checkpoint_dir, filename='best_model', save_last=True,
    save_top_k=1, monitor=self.config.val_metric, mode='max')

Now it is

  • self.trainer = init_trainer(checkpoint_dir=config.checkpoint_path,
    epochs=config.epochs,
    patience=config.patience,
    val_metric=config.val_metric)
  • def init_trainer(checkpoint_dir,
    epochs=10000,
    patience=5,
    mode='max',
    val_metric='P@1',
    silent=False,
    use_cpu=False):
    """Initialize a torch lightning trainer.
    Args:
    checkpoint_dir (str): Directory for saving models and log.
    epochs (int): Number of epochs to train. Defaults to 10000.
    patience (int): Number of epochs to wait for improvement before early stopping. Defaults to 5.
    mode (str): One of [min, max]. Decides whether the val_metric is minimizing or maximizing.
    val_metric (str): The metric to monitor for early stopping. Defaults to 'P@1'.
    silent (bool): Enable silent mode. Defaults to False.
    use_cpu (bool): Disable CUDA. Defaults to False.
    Returns:
    pl.Trainer: A torch lightning trainer.
    """
    checkpoint_callback = ModelCheckpoint(
    dirpath=checkpoint_dir, filename='best_model', save_last=True,
    save_top_k=1, monitor=val_metric, mode=mode)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.