tpu-multilingual-toxic-comment-classification's Introduction

TPU-Multilingual-Toxic-Comment-Classification

Data collected from kaggle : -

The primary data for the competition is, in each provided file, the comment_text column. This contains the text of a comment which has been classified as toxic or non-toxic (0...1 in the toxic column). The train set’s comments are entirely in english and come either from Civil Comments or Wikipedia talk page edits. The test data's comment_text columns are composed of multiple non-English languages.
The *-train.csv files and validation.csv file also contain a toxic column that is the target to be trained on.
The jigsaw-toxic-comment-train.csv and jigsaw-unintended-bias-train.csv contain training data (comment_text and toxic) from the two previous Jigsaw competitions, as well as additional columns that you may find useful.

jigsaw-toxic-comment-train.csv - data from our first competition. The dataset is made up of English comments from Wikipedia’s talk page edits.
jigsaw-unintended-bias-train.csv - data from our second competition. This is an expanded version of the Civil Comments dataset with a range of additional labels.
sample_submission.csv - a sample submission file in the correct format
test.csv - comments from Wikipedia talk pages in different non-English languages.
validation.csv - comments from Wikipedia talk pages in different non-English languages.
jigsaw-toxic-comment-train-processed-seqlen128.csv - training data preprocessed for BERT
jigsaw-unintended-bias-train-processed-seqlen128.csv - training data preprocessed for BERT
validation-processed-seqlen128.csv - validation data preprocessed for BERT
test-processed-seqlen128.csv - test data preprocessed for BERT

Recommend Projects