GithubHelp home page GithubHelp logo

dougtrajano / olid-br Goto Github PK

View Code? Open in Web Editor NEW
5.0 3.0 0.0 10.21 MB

Offensive Language Identification Dataset for Brazilian Portuguese.

Home Page: https://dougtrajano.github.io/olid-br/

License: Apache License 2.0

Jupyter Notebook 98.63% Python 1.37%
dataset python toxicity toxicity-classification

olid-br's Introduction

OLID-BR

Quality Gate Status Python 3.10

Offensive Language Identification Dataset for Brazilian Portuguese (OLID-BR) is a collection of Portuguese text with annotations for several NLP tasks related to toxicity/offensive language.

See the Dataset documentation for more information.

Technical details

This repository contains the source code to prepare, build, and publish the OLID-BR dataset.

The repository is structured as follows:

  • /docs contains the documentation for the dataset (available here).
  • /notebooks/baselines contains notebooks for baseline models.
  • /notebooks/collecting contains notebooks for data collection.
  • /notebooks/exploring contains notebooks for data exploration.
  • /notebooks/processing contains notebooks for data processing.
  • /properties contains the properties for the dataset.
  • /src contains the source code for the dataset.
  • /tests contains the tests for the dataset.
Architecture

Running Notebooks

You must define the following environment variables to run the notebooks:

Environment Variables

Variable Description Default Required
AWS_ACCESS_KEY_ID AWS Access Key ID None Optional
AWS_S3_BUCKET_PREFIX AWS S3 Bucket Prefix None Required
AWS_S3_BUCKET AWS S3 Bucket None Required
AWS_SECRET_ACCESS_KEY AWS Secret Access Key None Optional
FILTER_TOXIC_COMMENTS Filter Toxic Comments True Optional
HUGGINGFACE_HUB_TOKEN HuggingFace Hub Token None Required
KAGGLE_KEY Kaggle Key None Required
KAGGLE_USERNAME Kaggle Username None Required
LOG_LEVEL Log level INFO Optional
PERSPECTIVE_API_KEY Perspective API Key None Required
PERSPECTIVE_THRESHOLD Perspective Threshold 0.5 Optional
TWITTER_ACCESS_TOKEN Twitter Access Token None Required
TWITTER_ACCESS_TOKEN_SECRET Twitter Access Token Secret None Required
TWITTER_CONSUMER_KEY Twitter Consumer Key None Required
TWITTER_CONSUMER_SECRET Twitter Consumer Secret None Required
TWITTER_MAX_TWEETS Twitter Max Tweets or replies None Required
YOUTUBE_API_KEY YouTube API Key None Required
YOUTUBE_MAX_COMMENTS_PER_VIDEO YouTube Max Comments per video None Optional

The Jupyter Notebooks uses a .env file to read the environment variables.

If you are running the notebooks on Google Colab, you need to run the following commands:

!git clone https://github.com/DougTrajano/olid-br.git
!mv olid-br/* .
!rm -rf olid-br
!pip install -r requirements.txt

The Google Colab uses Python 3.7 which means that the numpy, pandas, and scikit-learn versions in the requirements.txt are not compatible, please update the requirements.txt file to the following versions:

numpy~=1.23.1
pandas~=1.3.5
scikit-learn~=1.0.2

Install dependencies

You can install the dependencies by running the following command:

pip install -r requirements.txt

Changelog

See the GitHub Releases page for a history of notable changes to this project.

License

The source code is licensed under the Apache 2.0 License.

The dataset is licensed under the Creative Commons Attribution 4.0 International License (CC BY 4.0).

olid-br's People

Contributors

dependabot[bot] avatar dougtrajano avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.