GithubHelp home page GithubHelp logo

balancedlossnlp's People

Contributors

blessu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

balancedlossnlp's Issues

class_freq

Thanks for sharing your excellent work!
What's the meaning about "class_freq "?
Does class_freq refer to the number of times each label appears in the training set? Is it a one-dimensional list?

Versioning Problems

Python Version: 3.7.6
When I try to train the model with installing requirements.txt during training I am getting "packaging.version.InvalidVersion: Invalid version: '0.10.1,<0.11'". I fixed it by installing a newer version of huggingface -> 4.28.1 however at eval.py I am getting "No module named torch.fx" error with torch==1.7.0. torch.fx was released in newer versions so for inference do I need a different torch version?

Dataset split and performance results

Hi, Thank you for your excellent research.

The first question is about the way in which the dataset is splitted. Taking Reuters-21578 as an example. According to the README.md, the training and testing dataset can be downloaded from the website such as kaggle. In other words, the dataset is seperatedly into training and testing dataset in advance. Now i have another dataset, how to split my dataset with imbalanced target labels into training,validate and testing dataset ?

Furthermore, the code in dataset_prep.py continue to split the training data set into train, validate and test dataset. I noted that the
train_test_split function is used here, that is,
data_train, data_val = train_test_split(data_train_all, random_state=123, test_size=1000)

I just wonder that why the train_test_split is used here instead of iterative_train_test_split from http://scikit.ml/index.html , since this is the multi label dataset.

The third question is that would you like to explain the performance results listed in table 2 in your paper in details. Secifically, how to compute the Total miF/maF ,Head(≥35) miF/maF, Med(8-35)miF/maF and Tail(≤8). In particular, the model is built once with the training dataset, and test the model on the total test data and the head,med, tail sub_dataset? Is this right?

thanks.

Map parameters in DB Loss

Firstly thank you for the great paper and for providing the code, but to use it in other applications/datasets, I'm trying to understand better the mapping parameters. Checking the Reuters and the PubMed training code, I can see that we have the parameters for the first:

map_param=dict(alpha=0.1, beta=10.0, gamma=0.9)
And for the second:
map_param=dict(alpha=0.1, beta=10.0, gamma=0.05)
Have you done hyper-parameter optimisation for choosing this gamma, or does it comes from an "exact" approach?

For CB-NTR, we don't have this parameter (so all the parameters are equal for both datasets); therefore, it seems a "safer" loss to use in other datasets. Can you explain the method of obtention of these loss parameters? Thank you

Datasets Preparation

Thanks for sharing your excellent work!
I am a newcomer in the field of multi-label text classification.
I don’t know where to download the train_data.json and test_data.json of Reuters-21578, also the data2020.json and data2021.json of PubMed-BioASQ.
These files are not included in the data downloaded from "https://www.kaggle.com/nltkdata/reuters".
Could you please provide the data used in the paper such as train_data.json, data_train.rand123 and labels_ref.rand123?
I really want to follow your work as soon as possible, thank you very much!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.