GithubHelp home page GithubHelp logo

unitaryai / detoxify Goto Github PK

View Code? Open in Web Editor NEW
897.0 15.0 115.0 52.02 MB

Trained models & code to predict toxic comments on all 3 Jigsaw Toxic Comment Challenges. Built using โšก Pytorch Lightning and ๐Ÿค— Transformers. For access to our API, please email us at [email protected].

Home Page: https://www.unitary.ai/

License: Apache License 2.0

Python 100.00%
bert bert-model huggingface-transformers huggingface nlp toxic-comment-classification toxicity toxic-comments sentence-classification kaggle-competition

detoxify's Introduction

๐Ÿ™Š Detoxify

Toxic Comment Classification with โšก Pytorch Lightning and ๐Ÿค— Transformers

PyPI version GitHub all releases CI testing Lint

Examples image

News & Updates

22-10-2021: New improved multilingual model & standardised class names

  • Updated the multilingual model weights used by Detoxify with a model trained on the translated data from the 2nd Jigsaw challenge (as well as the 1st). This model has also been trained to minimise bias and now returns the same categories as the unbiased model. New best AUC score on the test set: 92.11 (89.71 before).
  • All detoxify models now return consistent class names (e.g. "identity_attack" replaces "identity_hate" in the original model to match the unbiased classes).

03-09-2021: New improved unbiased model

  • Updated the unbiased model weights used by Detoxify with a model trained on both datasets from the first 2 Jigsaw challenges. New best score on the test set: 93.74 (93.64 before).

15-02-2021: Detoxify featured in Scientific American!

14-01-2021: Lightweight models

  • Added smaller models trained with Albert for the original and unbiased models! Can access these in the same way with detoxify using original-small and unbiased-small as inputs. The original-small achieved a mean AUC score of 98.28 (98.64 before) and the unbiased-small achieved a final score of 93.36 (93.64 before).

Description

Trained models & code to predict toxic comments on 3 Jigsaw challenges: Toxic comment classification, Unintendedย Bias in Toxic comments, Multilingual toxic comment classification.

Built by Laura Hanu at Unitary, where we are working to stop harmful content online by interpreting visual content in context.

Dependencies:

  • For inference:
    • ๐Ÿค— Transformers
    • โšก Pytorch lightning
  • For training will also need:
    • Kaggle API (to download data)
Challenge Year Goal Original Data Source Detoxify Model Name Top Kaggle Leaderboard Score % Detoxify Score %
Toxic Comment Classification Challenge 2018 build a multi-headed model thatโ€™s capable of detecting different types of of toxicity like threats, obscenity, insults, and identity-based hate. Wikipedia Comments original 98.86 98.64
Jigsaw Unintended Bias in Toxicity Classification 2019 build a model that recognizes toxicity and minimizes this type of unintended bias with respect to mentions of identities. You'll be using a dataset labeled for identity mentions and optimizing a metric designed to measure unintended bias. Civil Comments unbiased 94.73 93.74
Jigsaw Multilingual Toxic Comment Classification 2020 build effective multilingual models Wikipedia Comments + Civil Comments multilingual 95.36 92.11

It is also noteworthy to mention that the top leadearboard scores have been achieved using model ensembles. The purpose of this library was to build something user-friendly and straightforward to use.

Multilingual model language breakdown

Language Subgroup Subgroup size Subgroup AUC Score %
๐Ÿ‡ฎ๐Ÿ‡น it 8494 89.18
๐Ÿ‡ซ๐Ÿ‡ท fr 10920 89.61
๐Ÿ‡ท๐Ÿ‡บ ru 10948 89.81
๐Ÿ‡ต๐Ÿ‡น pt 11012 91.00
๐Ÿ‡ช๐Ÿ‡ธ es 8438 92.74
๐Ÿ‡น๐Ÿ‡ท tr 14000 97.19

Limitations and ethical considerations

If words that are associated with swearing, insults or profanity are present in a comment, it is likely that it will be classified as toxic, regardless of the tone or the intent of the author e.g. humorous/self-deprecating. This could present some biases towards already vulnerable minority groups.

The intended use of this library is for research purposes, fine-tuning on carefully constructed datasets that reflect real world demographics and/or to aid content moderators in flagging out harmful content quicker.

Some useful resources about the risk of different biases in toxicity or hate speech detection are:

Quick prediction

The multilingual model has been trained on 7 different languages so it should only be tested on: english, french, spanish, italian, portuguese, turkish or russian.

# install detoxify

pip install detoxify
from detoxify import Detoxify

# each model takes in either a string or a list of strings

results = Detoxify('original').predict('example text')

results = Detoxify('unbiased').predict(['example text 1','example text 2'])

results = Detoxify('multilingual').predict(['example text','exemple de texte','texto de ejemplo','testo di esempio','texto de exemplo','รถrnek metin','ะฟั€ะธะผะตั€ ั‚ะตะบัั‚ะฐ'])

# to specify the device the model will be allocated on (defaults to cpu), accepts any torch.device input

model = Detoxify('original', device='cuda')

# optional to display results nicely (will need to pip install pandas)

import pandas as pd

print(pd.DataFrame(results, index=input_text).round(5))

For more details check the Prediction section.

Labels

All challenges have a toxicity label. The toxicity labels represent the aggregate ratings of up to 10 annotators according the following schema:

  • Very Toxic (a very hateful, aggressive, or disrespectful comment that is very likely to make you leave a discussion or give up on sharing your perspective)
  • Toxic (a rude, disrespectful, or unreasonable comment that is somewhat likely to make you leave a discussion or give up on sharing your perspective)
  • Hard to Say
  • Not Toxic

More information about the labelling schema can be found here.

Toxic Comment Classification Challenge

This challenge includes the following labels:

  • toxic
  • severe_toxic
  • obscene
  • threat
  • insult
  • identity_hate

Jigsaw Unintended Bias in Toxicity Classification

This challenge has 2 types of labels: the main toxicity labels and some additional identity labels that represent the identities mentioned in the comments.

Only identities with more than 500 examples in the test set (combined public and private) are included during training as additional labels and in the evaluation calculation.

  • toxicity
  • severe_toxicity
  • obscene
  • threat
  • insult
  • identity_attack
  • sexual_explicit

Identity labels used:

  • male
  • female
  • homosexual_gay_or_lesbian
  • christian
  • jewish
  • muslim
  • black
  • white
  • psychiatric_or_mental_illness

A complete list of all the identity labels available can be found here.

Jigsaw Multilingual Toxic Comment Classification

Since this challenge combines the data from the previous 2 challenges, it includes all labels from above, however the final evaluation is only on:

  • toxicity

How to run

First, install dependencies

# clone project

git clone https://github.com/unitaryai/detoxify

# create virtual env

python3 -m venv toxic-env
source toxic-env/bin/activate

# install project
pip install -e detoxify

# or for training
pip install -e 'detoxify[dev]'

cd detoxify

Prediction

Trained models summary:

Model name Transformer type Data from
original bert-base-uncased Toxic Comment Classification Challenge
unbiased roberta-base Unintended Bias in Toxicity Classification
multilingual xlm-roberta-base Multilingual Toxic Comment Classification

For a quick prediction can run the example script on a comment directly or from a txt containing a list of comments.

# load model via torch.hub

python run_prediction.py --input 'example' --model_name original

# load model from from checkpoint path

python run_prediction.py --input 'example' --from_ckpt_path model_path

# save results to a .csv file

python run_prediction.py --input test_set.txt --model_name original --save_to results.csv

# to see usage

python run_prediction.py --help

Checkpoints can be downloaded from the latest release or via the Pytorch hub API with the following names:

  • toxic_bert
  • unbiased_toxic_roberta
  • multilingual_toxic_xlm_r
model = torch.hub.load('unitaryai/detoxify','toxic_bert')

Importing detoxify in python:

from detoxify import Detoxify

results = Detoxify('original').predict('some text')

results = Detoxify('unbiased').predict(['example text 1','example text 2'])

results = Detoxify('multilingual').predict(['example text','exemple de texte','texto de ejemplo','testo di esempio','texto de exemplo','รถrnek metin','ะฟั€ะธะผะตั€ ั‚ะตะบัั‚ะฐ'])

# to display results nicely

import pandas as pd

print(pd.DataFrame(results,index=input_text).round(5))

Training

If you do not already have a Kaggle account:

  • you need to create one to be able to download the data

  • go to My Account and click on Create New API Token - this will download a kaggle.json file

  • make sure this file is located in ~/.kaggle

# create data directory

mkdir jigsaw_data
cd jigsaw_data

# download data

kaggle competitions download -c jigsaw-toxic-comment-classification-challenge

kaggle competitions download -c jigsaw-unintended-bias-in-toxicity-classification

kaggle competitions download -c jigsaw-multilingual-toxic-comment-classification

Start Training

Toxic Comment Classification Challenge

# combine test.csv and test_labels.csv
python preprocessing_utils.py --test_csv jigsaw_data/jigsaw-toxic-comment-classification-challenge/test.csv --update_test

python train.py --config configs/Toxic_comment_classification_BERT.json

Unintended Bias in Toxicicity Challenge

python train.py --config configs/Unintended_bias_toxic_comment_classification_RoBERTa_combined.json

Multilingual Toxic Comment Classification

The translated data (source 1 source 2) can be downloaded from Kaggle in french, spanish, italian, portuguese, turkish, and russian (the languages available in the test set).

# combine test.csv and test_labels.csv
python preprocessing_utils.py --test_csv jigsaw_data/jigsaw-multilingual-toxic-comment-classification/test.csv --update_test

python train.py --config configs/Multilingual_toxic_comment_classification_XLMR.json

Monitor progress with tensorboard

tensorboard --logdir=./saved

Model Evaluation

Toxic Comment Classification Challenge

This challenge is evaluated on the mean AUC score of all the labels.

python evaluate.py --checkpoint saved/lightning_logs/checkpoints/example_checkpoint.pth --test_csv test.csv

Unintended Bias in Toxicicity Challenge

This challenge is evaluated on a novel bias metric that combines different AUC scores to balance overall performance. More information on this metric here.

python evaluate.py --checkpoint saved/lightning_logs/checkpoints/example_checkpoint.pth --test_csv test.csv

# to get the final bias metric
python model_eval/compute_bias_metric.py

Multilingual Toxic Comment Classification

This challenge is evaluated on the AUC score of the main toxic label.

python evaluate.py --checkpoint saved/lightning_logs/checkpoints/example_checkpoint.pth --test_csv test.csv

Citation

@misc{Detoxify,
  title={Detoxify},
  author={Hanu, Laura and {Unitary team}},
  howpublished={Github. https://github.com/unitaryai/detoxify},
  year={2020}
}

detoxify's People

Contributors

anitavero avatar borda avatar dcferreira avatar dependabot[bot] avatar gregpriday avatar jamt9000 avatar laurahanu avatar omidforoqi avatar pre-commit-ci[bot] avatar s2t2 avatar vela-zz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

detoxify's Issues

Error during training

I tried to start the training for Toxic Comment Classification Challenge with the code provided in the documentation:

# combine test.csv and test_labels.csv
python preprocessing_utils.py --test_csv jigsaw_data/jigsaw-toxic-comment-classification-challenge/test.csv --update_test

python train.py --config configs/Toxic_comment_classification_BERT.json

However, it returns the following error:

FileNotFoundError: [Errno 2] No such file or directory: 'jigsaw_data/jigsaw-toxic-comment-classification-challenge/val.csv'

I saw that only training and test datasets are present among the data. Should I use the test by changing the configuration file?
( I have downloaded the datasets from the following link: https://www.kaggle.com/competitions/jigsaw-toxic-comment-classification-challenge/data?select=train.csv.zip )

Thanks in advance

Detoxify Roadmap

Issue to keep track of improvements we'd like to make (in no particular order). Feedback and suggestions welcome!

  • better way to handle emojis (#27)
  • train the unbiased model on the Wikipedia dataset from the first challenge as well
  • add a multilingual light model (#17)
  • train the multilingual model with more languages
  • add more datasets to training
  • add new categories like personal attack (potentially using https://github.com/ewulczyn/wiki-detox)
  • improve bias metrics & test on different benchmark like HateCheck

The prediction is too slow (about ~3s/ text)

Hi,
Firstly, so thank you guys for this repo. It's so helpful for us.
I just used it for predicting texts in my dataset, but the speed is too slow. About 3 seconds per text. I wonder if it could be faster?

Looking forward to hearing from you soon.
Regards,
Luan

Weird behavior of Smaller and Larger Models for same Text

Hey! Thanks for this easy to get started package. I was testing both original and unbiased model on following sentences:

doc_1 = "I don't know why people don't support Muslims and call them terrorists often. They are not."
doc_2 = "There is nothing wrong being in a lesbian. Everyone has feelings."

Following are the toxicity scores by them:

model_testing

The original model which is supposed to be biased is predicting doc_1 to be non-toxic as it should while the unbiased-smaller model predicts it to be toxic.

Likewise, for doc_2, the prediction should be non-toxic in ideal scenario and the original model(both smaller and larger) being biased should predict it toxic. This is what it does:

model_testing_2

Original smaller one predicts toxic while the larger one does not. Can you explain what might be causing different behavior for same text in smaller and larger models in case of both original and unbiased models here?

Problem launching the model

Hello,
first of all thanks for your work. This model would amazing to use helping my company in comment-moderation around the social networks.
I followed your guide, but when I launch this to test the model:
python run_prediction.py --input 'example' --model_name original
I get this error:
RuntimeError: Only one file(not dir) is allowed in the zipfile

TypeError: expected str, bytes or os.PathLike object, not NoneType

Hey, I'm trying to use detoxify to predict, but I am getting the following error when I try to load the model (model = torch.hub.load('unitaryai/detoxify','toxic_bert')):

Downloading: "https://github.com/unitaryai/detoxify/archive/master.zip" to /root/.cache/torch/hub/master.zip

---------------------------------------------------------------------------

AttributeError                            Traceback (most recent call last)

[/usr/local/lib/python3.7/dist-packages/torch/serialization.py](https://localhost:8080/#) in _check_seekable(f)
    307     try:
--> 308         f.seek(f.tell())
    309         return True

AttributeError: 'NoneType' object has no attribute 'seek'


During handling of the above exception, another exception occurred:

AttributeError                            Traceback (most recent call last)

14 frames

[/usr/local/lib/python3.7/dist-packages/transformers/modeling_utils.py](https://localhost:8080/#) in load_state_dict(checkpoint_file)
    348     try:
--> 349         return torch.load(checkpoint_file, map_location="cpu")
    350     except Exception as e:

[/usr/local/lib/python3.7/dist-packages/torch/serialization.py](https://localhost:8080/#) in load(f, map_location, pickle_module, **pickle_load_args)
    593 
--> 594     with _open_file_like(f, 'rb') as opened_file:
    595         if _is_zipfile(opened_file):

[/usr/local/lib/python3.7/dist-packages/torch/serialization.py](https://localhost:8080/#) in _open_file_like(name_or_buffer, mode)
    234         elif 'r' in mode:
--> 235             return _open_buffer_reader(name_or_buffer)
    236         else:

[/usr/local/lib/python3.7/dist-packages/torch/serialization.py](https://localhost:8080/#) in __init__(self, buffer)
    219         super(_open_buffer_reader, self).__init__(buffer)
--> 220         _check_seekable(buffer)
    221 

[/usr/local/lib/python3.7/dist-packages/torch/serialization.py](https://localhost:8080/#) in _check_seekable(f)
    310     except (io.UnsupportedOperation, AttributeError) as e:
--> 311         raise_err_msg(["seek", "tell"], e)
    312     return False

[/usr/local/lib/python3.7/dist-packages/torch/serialization.py](https://localhost:8080/#) in raise_err_msg(patterns, e)
    303                                 + " try to load from it instead.")
--> 304                 raise type(e)(msg)
    305         raise e

AttributeError: 'NoneType' object has no attribute 'seek'. You can only torch.load from a file that is seekable. Please pre-load the data into a buffer like io.BytesIO and try to load from it instead.


During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)

[<ipython-input-12-ab26c4c96f7d>](https://localhost:8080/#) in <module>()
----> 1 model = torch.hub.load('unitaryai/detoxify','toxic_bert')

[/usr/local/lib/python3.7/dist-packages/torch/hub.py](https://localhost:8080/#) in load(repo_or_dir, model, source, force_reload, verbose, skip_validation, *args, **kwargs)
    397         repo_or_dir = _get_cache_or_reload(repo_or_dir, force_reload, verbose, skip_validation)
    398 
--> 399     model = _load_local(repo_or_dir, model, *args, **kwargs)
    400     return model
    401 

[/usr/local/lib/python3.7/dist-packages/torch/hub.py](https://localhost:8080/#) in _load_local(hubconf_dir, model, *args, **kwargs)
    426 
    427     entry = _load_entry_from_hubconf(hub_module, model)
--> 428     model = entry(*args, **kwargs)
    429 
    430     sys.path.remove(hubconf_dir)

[/content/detoxify/detoxify/detoxify.py](https://localhost:8080/#) in toxic_bert()
    125 
    126 def toxic_bert():
--> 127     return load_model("original")
    128 
    129 

[/content/detoxify/detoxify/detoxify.py](https://localhost:8080/#) in load_model(model_type, checkpoint)
     65 def load_model(model_type, checkpoint=None):
     66     if checkpoint is None:
---> 67         model, _, _ = load_checkpoint(model_type=model_type)
     68     else:
     69         model, _, _ = load_checkpoint(checkpoint=checkpoint)

[/content/detoxify/detoxify/detoxify.py](https://localhost:8080/#) in load_checkpoint(model_type, checkpoint, device, huggingface_config_path)
     57         **loaded["config"]["arch"]["args"],
     58         state_dict=loaded["state_dict"],
---> 59         huggingface_config_path=huggingface_config_path,
     60     )
     61 

[/content/detoxify/detoxify/detoxify.py](https://localhost:8080/#) in get_model_and_tokenizer(model_type, model_name, tokenizer_name, num_classes, state_dict, huggingface_config_path)
     23         num_labels=num_classes,
     24         state_dict=state_dict,
---> 25         local_files_only=huggingface_config_path is not None,
     26     )
     27     tokenizer = getattr(transformers, tokenizer_name).from_pretrained(

[/usr/local/lib/python3.7/dist-packages/transformers/modeling_utils.py](https://localhost:8080/#) in from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs)
   1795             if not is_sharded:
   1796                 # Time to load the checkpoint
-> 1797                 state_dict = load_state_dict(resolved_archive_file)
   1798             # set dtype to instantiate the model under:
   1799             # 1. If torch_dtype is not None, we use that dtype

[/usr/local/lib/python3.7/dist-packages/transformers/modeling_utils.py](https://localhost:8080/#) in load_state_dict(checkpoint_file)
    350     except Exception as e:
    351         try:
--> 352             with open(checkpoint_file) as f:
    353                 if f.read().startswith("version"):
    354                     raise OSError(

TypeError: expected str, bytes or os.PathLike object, not NoneType

I'm not sure what to do here.

Small models don't load on CPU-only machines

It looks like they were serialised with GPU tensors

model = Detoxify('original-small')
RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False.
If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu')
to map your storages to the CPU.

and so the map_location should be added when loading.

Unable to properly load state_dict

Code is very basic. Just two lines on Google Colab after pip install detoxify

states that state_dict is Nonetype and thus cannot load the model. Unsure how to fix.

Screen Shot 2022-04-06 at 1 29 41 PM

Memory leak running on lightweight model

First of all, this repository has been a great help for my research effort, and I really appreciate you sharing this to the public.

My issue is that the program would not free up its memory after done with the calculation. Here is the code to replicate the issue:

from detoxify import Detoxify
import torch
Detoxify('original-small').predict("Beep, beep, I'm a sheep.")
torch.cuda.empty_cache()

When monitoring GPU memory use, we find that the memory is not freed (about 600M more than before). This could be a problem when we run the model with large amount of data or multithreading. I have been trying to reasonably multithreading the model, only to find it always run out of memory when running larger datasets (each thread runs the model only one time then joins, 5 threads are used but after running 500 times it exceed 8G of GPU memory). Only after the kernel is completely destroyed the memory would be freed. I'm using Windows 10 Anaconda 1.10, Python 3.8.3 , PyTorch 1.7 w/ CUDA 11.0.

I'm currently looking into ways to solve this, would you be able to help?

PS: The same issue could present in original model, maybe GPU and CPU computing uses different garbage collection methods.

Detoxify on AWS Lambda

Hi Team,
I have been trying to implement my code using Detoxify library on the AWS lambda function.
For this, I am downloading the 'whl' file of the library and then zipping it to put it into a Lambda layer to get it used with the Lambda function, also ensuring that detoxify is installed on my local system.
This process has been working with other Python libraries, that I mentioned above to use libraries with the Lambda function. But it's not happening with Detoxify library.
Kindly let me know the reasons or suggestions to get it worked, if any.

Regards,
Parth Sharma

How do you load a custom checkpoint?

Hello I want to train the network on my own samples but I'm finding it quite difficult.

Right now I edited Toxic_comment_classification_BERT.json to point to my own training and test csv. Then I have to edit train.py to manually save the model object inside ToxicClassifier at the end of the training.

torch.save(model.model, 'custom.pt')

Then I have load the file manually, instantiate the normal instance of detoxify, and then replace the internal model object with the saved version to get it to work.

saved = torch.load('custom.pt')
d = detoxify.Detoxify('original')
d.model = saved

If I try to load a checkpoint generated at "saved\Jigsaw_BERT\lightning_logs\version_x\checkpoints\epoch=3-step=76.ckpt" with detoxify or try to instantiate detoxify with the "checkpoint parameter" or with a file generated by torch.save(model), it always says

Checkpoint needs to contain the config it was trained with as well as the state dict

What's the proper way of saving the checkpoint so it has the config and state dict with it? Or is my workaround the best way to use custom training data?

Add dutch language

Hi!
This is awesome!
Can you maybe add the dutch language?
Thanks,
Joachim.

Any suggestions to handle longer text?

I'm trying to do predictions with the pre-trained model and I keep running into the issue of;

Token indices sequence length is longer than the specified maximum sequence length for this model (1142 > 512). Running this sequence through the model will result in indexing errors
*** RuntimeError: The size of tensor a (1142) must match the size of tensor b (512) at non-singleton dimension 1

The issue is when I try to predict a text that is longer than 512, this happens. I understand this is because the string is long, other than chopping off the string. Is there any suggestions on how to deal with this problem with the package?

Thank you

GPU not used during Training

Hi there,
Thank you for the useful repository.

I am trying to use the script for model training. By leaving the "--device" parameter with the default value (default: all) should use the GPU if available, right?

However, even if available, it seems to do not using it.
It prints as output: GPU available: True, used: False, and the training script takes 48 hours.

Could you help me with how to make use of GPU?

Thanks in advance

allow full offline execution

There are some applications such as Kaggle which require running without an internet connection.
At this moment the package can be downloaded along with the checkpoints, but still, the creation requires pulling details from the HF hub, so it'll be cool if we download the details offline to be able to use them instead of the online source...

I guess that the solution would be exposing this argument in Module init:

pretrained_model_name_or_path=None,

Multilingual light model

Hi guys! Nice repo!

I'm deploying an app to detect hate tweets in Twitter as a part of my data science master, and it works perfectly in local.

As I live in Spain, the app main target are spanish accounts so I am developing the app on the multilingual model, the problem I am facing now is the deployment in a server like Streamlit Sharing or Heroku. I can't finish the deployment due to host size limits.

I've seen that you have developed light models for original and ubiased, but not for multilingual. Do you expect to deploy a light multilingual model early? If not, do you came across any workaround to avoid Streamlit Sharing (800 mb) or Heroku (500 mb) size limit?

Thank you so much!

Unbiased model not returning identity labels

Thanks for the great repo!

I'm running the 'Quick prediction' code using the unbiased model, but there are no identity labels being returned - even with severe toxicity. I only get the toxicity labels.

Am I missing something?

Thanks again!

Installing on Heroku

Hey, thanks for this great package.

I havedetoxify in my "requirements.txt" file, and it works great locally.

But when I push a Heroku server, it raises this error: "Compiled slug size: 1.1G is too large (max is 500M)" when trying to install and compress the packages.

Some research indicates this error is sometimes caused because the tensorflow dependency is large, and it seems we can get passed this by installing tensorflow-cpu instead of tensorflow. But I'm not sure if tensorflow is even a dependency or not.

I was wondering if you have any ideas or suggestions as to how I could get your package to work on a Heroku server.

Thanks.

Detoxify pip .whl installs files in other locations

pip install detoxify should just install detoxify to the python site-packages/

Instead it is also creating the folders src and tests which could break unrelated packages

Uninstalling detoxify-0.2.0:
  Would remove:
    /Users/jamesthewlis/miniconda3/envs/detoxify2/lib/python3.6/site-packages/detoxify-0.2.0.dist-info/*
    /Users/jamesthewlis/miniconda3/envs/detoxify2/lib/python3.6/site-packages/detoxify/*
    /Users/jamesthewlis/miniconda3/envs/detoxify2/lib/python3.6/site-packages/src/*
    /Users/jamesthewlis/miniconda3/envs/detoxify2/lib/python3.6/site-packages/tests/*
unzip -l detoxify-0.2.0-py3-none-any.whl                                                                                          ๎‚ฒ โœ” ๎‚ณ 25s ๏‰’ ๎‚ณ detoxify2 ๎œผ
Archive:  detoxify-0.2.0-py3-none-any.whl
  Length      Date    Time    Name
---------  ---------- -----   ----
      225  11-09-2020 11:07   detoxify/__init__.py
     4184  12-15-2020 21:19   detoxify/detoxify.py
        0  11-09-2020 11:07   src/__init__.py
     8041  11-09-2020 11:07   src/data_loaders.py
      960  11-09-2020 11:07   src/utils.py
        0  11-09-2020 11:07   tests/__init__.py
     2031  11-09-2020 11:07   tests/test_trainer.py
    11357  12-16-2020 09:51   detoxify-0.2.0.dist-info/LICENSE
    11824  12-16-2020 09:51   detoxify-0.2.0.dist-info/METADATA
       92  12-16-2020 09:51   detoxify-0.2.0.dist-info/WHEEL
        9  12-16-2020 09:51   detoxify-0.2.0.dist-info/top_level.txt
      907  12-16-2020 09:51   detoxify-0.2.0.dist-info/RECORD
---------                     -------
    39630                     12 files

Question regards training with other models

Hello, I am a relatively new user of NLP and I am currently working on a project where I need to use the output of my model as input to your model, which should be applied sequentially.

During the training process, I need to pass the loss through your model without updating any of its weights, and only update the weights of my model. I have a question regarding the training process: Should I follow the steps outlined in the "Training" section of your documentation, or can I use the code provided in the "Prediction" section directly?

I would greatly appreciate your help and guidance on this matter. Thank you in advance.

Batch prediction on a very large text file?

Hey guys great repo, I played with your model and works very well on random real world data. I'd like to apply inference on a test file with 2 million lines. How can I do batch prediction with the 'multilingual' model since I couldn't fit the data in a 16GB GPU.

Number of epochs to get the best model

Hello,

I wanted to reproduce the results by the models and was wondering the number of epochs each model had to be trained to get the scores shown.

Thank you!

Feature: Add lightweight models

Motivation

Currently this library only uses transformer models >= 418mb in size. Would be helpful to add functionality for lighter language models, such as Albert or a small Roberta, which would be more efficient for practical applications.

Implementation

Add a lightweight version for each toxic model e.g original-small, unbiased-small.

How to get the model?

How to get the model and pass it to use in javascript like tensorflow toxicity model?

Toxicity scores, same as Perspective API?

Great repo!

I have a question, I hope someone can help?

Are the toxicity scores provided by the Unitary models, probability scores, in the same way that perspective API returns these values?

"The only score type currently offered is a probability score. It indicates how likely it is that a reader would perceive the comment provided in the request as containing the given attribute. For each attribute, the scores provided represent a probability, with a value between 0 and 1. A higher score indicates a greater likelihood that a reader would perceive the comment as containing the given attribute. For example, a comment like โ€œYou are an idiotโ€ may receive a probability score of 0.8 for attribute TOXICITY, indicating that 8 out of 10 people would perceive that comment as toxic. "

Or do they represent the extent of the toxicity?

Thanks so much!

Getting got_ver is None error when importing

I re-installed detoxify to the latest version and now I'm getting the following error when I try import detoxify. This has something to do with the transformers and torch dependencies, looks like the latest versions are incompatible

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\ProgramData\Anaconda2\envs\py36\lib\site-packages\detoxify\__init__.py", line 1, in <module>
    from .detoxify import (
  File "C:\ProgramData\Anaconda2\envs\py36\lib\site-packages\detoxify\detoxify.py", line 2, in <module>
    import transformers
  File "C:\ProgramData\Anaconda2\envs\py36\lib\site-packages\transformers\__init__.py", line 43, in <module>
    from . import dependency_versions_check
  File "C:\ProgramData\Anaconda2\envs\py36\lib\site-packages\transformers\dependency_versions_check.py", line 41, in <module>
    require_version_core(deps[pkg])
  File "C:\ProgramData\Anaconda2\envs\py36\lib\site-packages\transformers\utils\versions.py", line 120, in require_version_core
    return require_version(requirement, hint)
  File "C:\ProgramData\Anaconda2\envs\py36\lib\site-packages\transformers\utils\versions.py", line 114, in require_version
    _compare_versions(op, got_ver, want_ver, requirement, pkg, hint)
  File "C:\ProgramData\Anaconda2\envs\py36\lib\site-packages\transformers\utils\versions.py", line 45, in _compare_versions
    raise ValueError("got_ver is None")
ValueError: got_ver is None

The multilingual CSVs are missing from Kaggle

The various CSVs from Jigsaw Multilingual Toxic Comment Classification appear to no longer be available. These are:

jigsaw_data/jigsaw-multilingual-toxic-comment-classification/jigsaw-toxic-comment-train-google-es-cleaned.csv
jigsaw_data/jigsaw-multilingual-toxic-comment-classification/jigsaw-toxic-comment-train-google-fr-cleaned.csv,
jigsaw_data/jigsaw-multilingual-toxic-comment-classification/jigsaw-toxic-comment-train-google-it-cleaned.csv,
jigsaw_data/jigsaw-multilingual-toxic-comment-classification/jigsaw-toxic-comment-train-google-pt-cleaned.csv,
jigsaw_data/jigsaw-multilingual-toxic-comment-classification/jigsaw-toxic-comment-train-google-ru-cleaned.csv,
jigsaw_data/jigsaw-multilingual-toxic-comment-classification/jigsaw-toxic-comment-train-google-tr-cleaned.csv,
jigsaw_data/jigsaw-unintended-bias-in-toxicity-classification/jigsaw-unintended-bias-train_es_clean.csv,
jigsaw_data/jigsaw-unintended-bias-in-toxicity-classification/jigsaw-unintended-bias-train_fr_clean.csv,
jigsaw_data/jigsaw-unintended-bias-in-toxicity-classification/jigsaw-unintended-bias-train_it_clean.csv,
jigsaw_data/jigsaw-unintended-bias-in-toxicity-classification/jigsaw-unintended-bias-train_pt_clean.csv,
jigsaw_data/jigsaw-unintended-bias-in-toxicity-classification/jigsaw-unintended-bias-train_ru_clean.csv,
jigsaw_data/jigsaw-unintended-bias-in-toxicity-classification/jigsaw-unintended-bias-train_tr_clean.csv,

I've been able to find some of these at https://www.kaggle.com/miklgr500/jigsaw-train-multilingual-coments-google-api but these do not include the bias CSVs.

Do you happen to know where these are located?

Thank you

Other labels availability on multilingual models

Hi unitary!

Thank you for this fantastic project.

I was wondering how we could easily get access to the other labels in the datasets like severe_toxicity or identity_attack on the multilingual model, would you have any recommendations on how to achieve that?

Best,
marsouin

RuntimeError: Only one file(not dir) is allowed in the zipfile

This error happens when I try executing the code below. I'm using Anaconda (run as admin) and Python 3.6+

from detoxify import Detoxify

results = Detoxify('original').predict('example text')

RuntimeError: Only one file(not dir) is allowed in the zipfile

How to overcome memory issues when predicting large batches of data?

Hello team,

I have a dataset of about 8000 comments each comment is around 6 to 8 words (some are shorted with 2 words only)

The problem is that I am unable to get the prediction since I run out of GPU memory during the process. To overcome this I am using a custom loop to loop over comments in batches and append the results to a data frame.

comments_list = comments["text"].to_list()
 df = pd.DataFrame()

 for i in range(0, len(comments_list), 32):
     comms = comments_list[i : i + 32]
     results = Detoxify("original", device=device).predict(comms)
     results = pd.DataFrame(results)
     df = df.append(results, ignore_index=True)

Is there a more efficient way of doing this than writing a for loop?

Currently I have a 16GB Testa T4 as GPU.

Thanks!

Detoxify doesn't work well on Emojis

Currently all detoxify models seem to not recognize emojis that are meant to be toxic/hateful in context or on their own (#26). While the Bert tokenizer returns the same output for different emojis, Roberta-based tokenizers seem to differentiate between different emoji inputs.

Some potential solutions:

  • replacement method (fast): use an emoji library (e.g. demoji) and replace current emojis with their text description (i.e. ๐Ÿ–• -> 'middle finger'). While this would work in some cases (when emojis are used with their literal meaning), there will be some cases where the description wouldn't make the intended meaning clearer e.g. drugs or sexually-related emojis. We would also need to be careful with how/when we're using emojis as keywords (could check for key emojis first and then replace).
  • training method (slow): train models to recognise various emojis under different contexts, might also be something that emerges naturally by training on lots of data containing emojis. Might work with the common use cases, but work less well with lesser used emojis. Would not work with the Bert tokenizer.
  • hybrid method where we train with emoji descriptions directly and replace them at inference time

To dos:

  • investigate how well the replacement method works on a dataset like Hatemoji
  • finetune Detoxify with Hatemoji train set and compare

false positive

wtf? for some reason this message is flagged as toxic:
"who selling lup pots"
can you fix? using original data set

UnicodeDecodeError when installing from git

System: Windows 10
Python version: 3.10.9
Cmd reproduction:

G:\TestProject>python3 -m venv testenv

G:\TestProject>source testenv/bin/activate
'source' is not recognized as an internal or external command,
operable program or batch file.

G:\TestProject>/testenv/scripts/activate.bat
The system cannot find the path specified.

G:\TestProject>G:\TestProject\testenv\Scripts\activate.bat

(testenv) G:\TestProject>git clone https://github.com/unitaryai/detoxify
Cloning into 'detoxify'...
remote: Enumerating objects: 885, done.
remote: Counting objects: 100% (885/885), done.
remote: Compressing objects: 100% (390/390), done.
remote: Total 885 (delta 505), reused 834 (delta 482), pack-reused 0
Receiving objects: 100% (885/885), 52.01 MiB | 17.26 MiB/s, done.
Resolving deltas: 100% (505/505), done.

(testenv) G:\TestProject>pip install -e detoxify
Obtaining file:///G:/TestProject/detoxify
  Installing build dependencies ... done
  Checking if build backend supports build_editable ... done
  Getting requirements to build editable ... error
  error: subprocess-exited-with-error

  ร— Getting requirements to build editable did not run successfully.
  โ”‚ exit code: 1
  โ•ฐโ”€> [21 lines of output]
      Traceback (most recent call last):
        File "G:\TestProject\testenv\lib\site-packages\pip\_vendor\pep517\in_process\_in_process.py", line 351, in <module>
          main()
        File "G:\TestProject\testenv\lib\site-packages\pip\_vendor\pep517\in_process\_in_process.py", line 333, in main
          json_out['return_val'] = hook(**hook_input['kwargs'])
        File "G:\TestProject\testenv\lib\site-packages\pip\_vendor\pep517\in_process\_in_process.py", line 132, in get_requires_for_build_editable
          return hook(config_settings)
        File "C:\Users\User\AppData\Local\Temp\pip-build-env-kx9fk13h\overlay\Lib\site-packages\setuptools\build_meta.py", line 447, in get_requires_for_build_editable
          return self.get_requires_for_build_wheel(config_settings)
        File "C:\Users\User\AppData\Local\Temp\pip-build-env-kx9fk13h\overlay\Lib\site-packages\setuptools\build_meta.py", line 338, in get_requires_for_build_wheel
          return self._get_build_requires(config_settings, requirements=['wheel'])
        File "C:\Users\User\AppData\Local\Temp\pip-build-env-kx9fk13h\overlay\Lib\site-packages\setuptools\build_meta.py", line 320, in _get_build_requires
          self.run_setup()
        File "C:\Users\User\AppData\Local\Temp\pip-build-env-kx9fk13h\overlay\Lib\site-packages\setuptools\build_meta.py", line 484, in run_setup
          super(_BuildMetaLegacyBackend,
        File "C:\Users\User\AppData\Local\Temp\pip-build-env-kx9fk13h\overlay\Lib\site-packages\setuptools\build_meta.py", line 335, in run_setup
          exec(code, locals())
        File "<string>", line 6, in <module>
        File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.2544.0_x64__qbz5n2kfra8p0\lib\encodings\cp1250.py", line 23, in decode
          return codecs.charmap_decode(input,self.errors,decoding_table)[0]
      UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 5960: character maps to <undefined>
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

ร— Getting requirements to build editable did not run successfully.
โ”‚ exit code: 1
โ•ฐโ”€> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.

index

What is input_text here:
"print(pd.DataFrame(results, index=input_text).round(5))"

TypeError: 'NoneType' object is not subscriptable

I am having this error while trying to load the model.

from detoxify import Detoxify

model = Detoxify('original', device="cuda")


---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In [15], line 3
      1 from detoxify import Detoxify
----> 3 results = Detoxify('original').predict('some text')

File ~/.conda/envs/py/lib/python3.9/site-packages/detoxify/detoxify.py:103, in Detoxify.__init__(self, model_type, checkpoint, device, huggingface_config_path)
    101 def __init__(self, model_type="original", checkpoint=PRETRAINED_MODEL, device="cpu", huggingface_config_path=None):
    102     super().__init__()
--> 103     self.model, self.tokenizer, self.class_names = load_checkpoint(
    104         model_type=model_type,
    105         checkpoint=checkpoint,
    106         device=device,
    107         huggingface_config_path=huggingface_config_path,
    108     )
    109     self.device = device
    110     self.model.to(self.device)

File ~/.conda/envs/py/lib/python3.9/site-packages/detoxify/detoxify.py:56, in load_checkpoint(model_type, checkpoint, device, huggingface_config_path)
     50 change_names = {
     51     "toxic": "toxicity",
     52     "identity_hate": "identity_attack",
     53     "severe_toxic": "severe_toxicity",
     54 }
     55 class_names = [change_names.get(cl, cl) for cl in class_names]
---> 56 model, tokenizer = get_model_and_tokenizer(
     57     **loaded["config"]["arch"]["args"],
     58     state_dict=loaded["state_dict"],
     59     huggingface_config_path=huggingface_config_path,
     60 )
     62 return model, tokenizer, class_names

File ~/.conda/envs/py/lib/python3.9/site-packages/detoxify/detoxify.py:20, in get_model_and_tokenizer(model_type, model_name, tokenizer_name, num_classes, state_dict, huggingface_config_path)
     16 def get_model_and_tokenizer(
     17     model_type, model_name, tokenizer_name, num_classes, state_dict, huggingface_config_path=None
     18 ):
     19     model_class = getattr(transformers, model_name)
---> 20     model = model_class.from_pretrained(
     21         pretrained_model_name_or_path=None,
     22         config=huggingface_config_path or model_type,
     23         num_labels=num_classes,
     24         state_dict=state_dict,
     25         local_files_only=huggingface_config_path is not None,
     26     )
     27     tokenizer = getattr(transformers, tokenizer_name).from_pretrained(
     28         huggingface_config_path or model_type,
     29         local_files_only=huggingface_config_path is not None,
     30         # TODO: may be needed to let it work with Kaggle competition
     31         # model_max_length=512,
     32     )
     34     return model, tokenizer

File ~/.conda/envs/py/lib/python3.9/site-packages/transformers/modeling_utils.py:2379, in PreTrainedModel.from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs)
   2369     if dtype_orig is not None:
   2370         torch.set_default_dtype(dtype_orig)
   2372     (
   2373         model,
   2374         missing_keys,
   2375         unexpected_keys,
   2376         mismatched_keys,
   2377         offload_index,
   2378         error_msgs,
-> 2379     ) = cls._load_pretrained_model(
   2380         model,
   2381         state_dict,
   2382         loaded_state_dict_keys,  # XXX: rename?
   2383         resolved_archive_file,
   2384         pretrained_model_name_or_path,
   2385         ignore_mismatched_sizes=ignore_mismatched_sizes,
   2386         sharded_metadata=sharded_metadata,
   2387         _fast_init=_fast_init,
   2388         low_cpu_mem_usage=low_cpu_mem_usage,
   2389         device_map=device_map,
   2390         offload_folder=offload_folder,
   2391         offload_state_dict=offload_state_dict,
   2392         dtype=torch_dtype,
   2393         load_in_8bit=load_in_8bit,
   2394     )
   2396 model.is_loaded_in_8bit = load_in_8bit
   2398 # make sure token embedding weights are still tied if needed

File ~/.conda/envs/py/lib/python3.9/site-packages/transformers/modeling_utils.py:2572, in PreTrainedModel._load_pretrained_model(cls, model, state_dict, loaded_keys, resolved_archive_file, pretrained_model_name_or_path, ignore_mismatched_sizes, sharded_metadata, _fast_init, low_cpu_mem_usage, device_map, offload_folder, offload_state_dict, dtype, load_in_8bit)
   2569                 del state_dict[checkpoint_key]
   2570     return mismatched_keys
-> 2572 folder = os.path.sep.join(resolved_archive_file[0].split(os.path.sep)[:-1])
   2573 if device_map is not None and is_safetensors:
   2574     param_device_map = expand_device_map(device_map, original_loaded_keys)

TypeError: 'NoneType' object is not subscriptable

pip install information:

Collecting detoxify
  Downloading detoxify-0.5.0-py3-none-any.whl (12 kB)
Collecting transformers!=4.18.0
  Downloading transformers-4.25.1-py3-none-any.whl (5.8 MB)
     โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ” 5.8/5.8 MB 75.2 MB/s eta 0:00:0000:0100:01
Collecting torch>=1.7.0
  Downloading torch-1.13.0-cp39-cp39-manylinux1_x86_64.whl (890.2 MB)
     โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ” 890.2/890.2 MB 3.6 MB/s eta 0:00:0000:0100:01
Collecting sentencepiece>=0.1.94
  Downloading sentencepiece-0.1.97-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
     โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ” 1.3/1.3 MB 107.1 MB/s eta 0:00:00
Collecting typing-extensions
  Downloading typing_extensions-4.4.0-py3-none-any.whl (26 kB)
Collecting nvidia-cuda-nvrtc-cu11==11.7.99
  Downloading nvidia_cuda_nvrtc_cu11-11.7.99-2-py3-none-manylinux1_x86_64.whl (21.0 MB)
     โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ” 21.0/21.0 MB 77.9 MB/s eta 0:00:0000:0100:01
Collecting nvidia-cublas-cu11==11.10.3.66
  Downloading nvidia_cublas_cu11-11.10.3.66-py3-none-manylinux1_x86_64.whl (317.1 MB)
     โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ” 317.1/317.1 MB 8.9 MB/s eta 0:00:0000:0100:01
Collecting nvidia-cuda-runtime-cu11==11.7.99
  Downloading nvidia_cuda_runtime_cu11-11.7.99-py3-none-manylinux1_x86_64.whl (849 kB)
     โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ” 849.3/849.3 kB 112.2 MB/s eta 0:00:00
Collecting nvidia-cudnn-cu11==8.5.0.96
  Downloading nvidia_cudnn_cu11-8.5.0.96-2-py3-none-manylinux1_x86_64.whl (557.1 MB)
     โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ” 557.1/557.1 MB 6.0 MB/s eta 0:00:0000:0100:01
Requirement already satisfied: wheel in /home/annahaz/.conda/envs/py/lib/python3.9/site-packages (from nvidia-cublas-cu11==11.10.3.66->torch>=1.7.0->detoxify) (0.37.1)
Requirement already satisfied: setuptools in /home/annahaz/.conda/envs/py/lib/python3.9/site-packages (from nvidia-cublas-cu11==11.10.3.66->torch>=1.7.0->detoxify) (63.4.1)
Collecting regex!=2019.12.17
  Downloading regex-2022.10.31-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (769 kB)
     โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ” 770.0/770.0 kB 116.5 MB/s eta 0:00:00
Requirement already satisfied: numpy>=1.17 in /home/annahaz/.conda/envs/py/lib/python3.9/site-packages (from transformers!=4.18.0->detoxify) (1.23.4)
Requirement already satisfied: tqdm>=4.27 in /home/annahaz/.conda/envs/py/lib/python3.9/site-packages (from transformers!=4.18.0->detoxify) (4.64.1)
Requirement already satisfied: pyyaml>=5.1 in /home/annahaz/.conda/envs/py/lib/python3.9/site-packages (from transformers!=4.18.0->detoxify) (6.0)
Collecting filelock
  Downloading filelock-3.8.2-py3-none-any.whl (10 kB)
Requirement already satisfied: packaging>=20.0 in /home/annahaz/.conda/envs/py/lib/python3.9/site-packages (from transformers!=4.18.0->detoxify) (21.3)
Requirement already satisfied: requests in /home/annahaz/.conda/envs/py/lib/python3.9/site-packages (from transformers!=4.18.0->detoxify) (2.28.1)
Collecting huggingface-hub<1.0,>=0.10.0
  Downloading huggingface_hub-0.11.1-py3-none-any.whl (182 kB)
     โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ” 182.4/182.4 kB 103.0 MB/s eta 0:00:00
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
     โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ”โ” 7.6/7.6 MB 33.4 MB/s eta 0:00:0000:0100:01m
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /home/annahaz/.conda/envs/py/lib/python3.9/site-packages (from packaging>=20.0->transformers!=4.18.0->detoxify) (3.0.9)
Requirement already satisfied: certifi>=2017.4.17 in /home/annahaz/.conda/envs/py/lib/python3.9/site-packages (from requests->transformers!=4.18.0->detoxify) (2022.9.24)
Requirement already satisfied: charset-normalizer<3,>=2 in /home/annahaz/.conda/envs/py/lib/python3.9/site-packages (from requests->transformers!=4.18.0->detoxify) (2.1.1)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /home/annahaz/.conda/envs/py/lib/python3.9/site-packages (from requests->transformers!=4.18.0->detoxify) (1.26.12)
Requirement already satisfied: idna<4,>=2.5 in /home/annahaz/.conda/envs/py/lib/python3.9/site-packages (from requests->transformers!=4.18.0->detoxify) (3.4)
Installing collected packages: tokenizers, sentencepiece, typing-extensions, regex, nvidia-cuda-runtime-cu11, nvidia-cuda-nvrtc-cu11, nvidia-cublas-cu11, filelock, nvidia-cudnn-cu11, huggingface-hub, transformers, torch, detoxify
Successfully installed detoxify-0.5.0 filelock-3.8.2 huggingface-hub-0.11.1 nvidia-cublas-cu11-11.10.3.66 nvidia-cuda-nvrtc-cu11-11.7.99 nvidia-cuda-runtime-cu11-11.7.99 nvidia-cudnn-cu11-8.5.0.96 regex-2022.10.31 sentencepiece-0.1.97 tokenizers-0.13.2 torch-1.13.0 transformers-4.25.1 typing-extensions-4.4.0

additional information
python 3.9.13 haa1d7c7_2
on linux

Question regards use case

Hi,

I am new in the development of NLP models and I have in mind using your model as an assistant to fine-tune a chit-chat bot. I have seen in several issues that you don't recommend using it. Is it true? Also what kind of model as experts do you suggest using, even between your current models?

Thank you in advance,

Don't automatically use GPU

At the moment it will automatically use the GPU if cuda is available, with no way to select CPU mode or another sort of device.

self.device = "cuda" if torch.cuda.is_available() else "cpu"

Which can be unexpected and cause anything the user is running on GPU 0 to run out of memory.

Suggested fix:
Have a device argument in the Detoxify __init__ that accepts any torch device specifier and defaults to cpu

Checkpoints missing optimizer_states

Thank you for your work on this very useful library!

I have had success training Albert Unbiased from scratch. I'm curious how model performance would compare if training continued from one of your checkpoints (unbiased-albert-c8519128.ckpt in this case). However if I attempt to initiate train.py with this file I am getting an error like:

KeyError: 'Trying to restore training state but checkpoint contains only the model. This is probably due to ModelCheckpoint.save_weights_only being set to True.'

FYI I am using the following command:

python train.py --config configs/Unintended_bias_toxic_comment_classification_Albert_revised_training.json -d 1 --num_workers 0 -e 101 -r model_ckpts/unbiased-albert-c8519128_modified_state_dict.ckpt

Inspecting the checkpoint file I indeed observe it is missing some components, most critical of which (I think) is the optimizer_states. Comparing to one of my own checkpoints it looks like what is absent includes: ['pytorch-lightning_version', 'callbacks', 'optimizer_states', 'lr_schedulers', 'hparams_name', 'hyper_parameters'].

I'm wondering if I am doing something wrong? Or else, is it possible for you to share new versions of your checkpoints that include these missing components?

Dependency error in CI testing

CI testing fails with this error:
image

TODO: Check if the --use-feature=2020-resolver flag is still needed or 2020-resolver needs to be replaced with one of fast-deps, truststore, no-binary-enable-wheel-cache

- classifier.out_proj.weight: found shape torch.Size([16, 768]) in the checkpoint and torch.Size([2, 768]) in the model instantiated

I am using your model to fine-tune on binary classification task. ( Number of classes =2) instead of 16.

My class labels are just 0 and 1

https://huggingface.co/unitary/unbiased-toxic-roberta/tree/main

I am writing the below code:

Metrics to calculate loss on binary labels as accuracy

def compute_metrics(eval_pred):
    
    logits, labels = eval_pred
   

    predictions = np.argmax(logits, axis=-1)
    
    acc = np.sum(predictions == labels) / predictions.shape[0]
    
    return {"accuracy" : acc}
model = tr.RobertaForSequenceClassification.from_pretrained("/home/pc/unbiased_toxic_roberta",num_labels=2)
model.to(device)



training_args = tr.TrainingArguments(
#     report_to = 'wandb',
    output_dir='/home/pc/1_Proj_hate_speech/results_roberta',          # output directory
    overwrite_output_dir = True,
    num_train_epochs=20,              # total number of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=32,   # batch size for evaluation
    learning_rate=2e-5,
    warmup_steps=1000,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs3',            # directory for storing logs
    logging_steps=1000,
    evaluation_strategy="epoch"
    ,save_strategy="epoch"
    ,load_best_model_at_end=True
)


trainer = tr.Trainer(
    model=model,                         # the instantiated ๐Ÿค— Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_data,         # training dataset
    eval_dataset=val_data,             # evaluation dataset
    compute_metrics=compute_metrics
)

Error:

- classifier.out_proj.weight: found shape torch.Size([16, 768]) in the checkpoint and torch.Size([2, 768]) in the model instantiated
- classifier.out_proj.bias: found shape torch.Size([16]) in the checkpoint and torch.Size([2]) in the model instantiated

How can I solve this?

Mismatched results between your lib vs huggingface

Hi team,

First of all, thank you very much for the library. But I need a clarification why your results are being different than huggingface's result for the same input? Can you please help me with this?
Thanks

RuntimeError

RuntimeError: /Users/qab/.cache/torch/checkpoints/toxic_original-c1212f89.ckpt is a zip archive (did you mean to use torch.jit.load()?)
I get this error trying to run this for the first time. Any help?

Unable to load any model.

I am unsure whether this is due to being on an M1, but it is my suspicion after having tested with various Python versions satisfying the >=3.6 requirement on PyPI. It works on my personal laptop running an Arch-based Linux distribution, using the same code and Python 3.9.


The following code is being run using Python 3.9.12.

__import__('detoxify').Detoxify('original').predict('this does not work')

Running this simple test prediction will throw the error below.

Traceback (most recent call last):
  File "/opt/homebrew/lib/python3.9/site-packages/torch/serialization.py", line 309, in _check_seekable
    f.seek(f.tell())
AttributeError: 'NoneType' object has no attribute 'seek'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/homebrew/lib/python3.9/site-packages/transformers/modeling_utils.py", line 349, in load_state_dict
    return torch.load(checkpoint_file, map_location="cpu")
  File "/opt/homebrew/lib/python3.9/site-packages/torch/serialization.py", line 699, in load
    with _open_file_like(f, 'rb') as opened_file:
  File "/opt/homebrew/lib/python3.9/site-packages/torch/serialization.py", line 236, in _open_file_like
    return _open_buffer_reader(name_or_buffer)
  File "/opt/homebrew/lib/python3.9/site-packages/torch/serialization.py", line 221, in __init__
    _check_seekable(buffer)
  File "/opt/homebrew/lib/python3.9/site-packages/torch/serialization.py", line 312, in _check_seekable
    raise_err_msg(["seek", "tell"], e)
  File "/opt/homebrew/lib/python3.9/site-packages/torch/serialization.py", line 305, in raise_err_msg
    raise type(e)(msg)
AttributeError: 'NoneType' object has no attribute 'seek'. You can only torch.load from a file that is seekable. Please pre-load the data into a buffer like io.BytesIO and try to load from it instead.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/homebrew/lib/python3.9/site-packages/detoxify/detoxify.py", line 93, in __init__
    self.model, self.tokenizer, self.class_names = load_checkpoint(
  File "/opt/homebrew/lib/python3.9/site-packages/detoxify/detoxify.py", line 49, in load_checkpoint
    model, tokenizer = get_model_and_tokenizer(
  File "/opt/homebrew/lib/python3.9/site-packages/detoxify/detoxify.py", line 19, in get_model_and_tokenizer
    model = model_class.from_pretrained(
  File "/opt/homebrew/lib/python3.9/site-packages/transformers/modeling_utils.py", line 1797, in from_pretrained
    state_dict = load_state_dict(resolved_archive_file)
  File "/opt/homebrew/lib/python3.9/site-packages/transformers/modeling_utils.py", line 352, in load_state_dict
    with open(checkpoint_file) as f:
TypeError: expected str, bytes or os.PathLike object, not NoneType

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.