GithubHelp home page GithubHelp logo

ziweiji / nusa-crowd Goto Github PK

View Code? Open in Web Editor NEW

This project forked from indonlp/nusa-crowd

0.0 0.0 0.0 2.06 MB

A collaborative project to collect datasets in Indonesian languages.

License: Apache License 2.0

Shell 0.03% Python 27.88% Makefile 0.03% Jupyter Notebook 72.05%

nusa-crowd's Introduction

Welcome to the NusaCrowd!

132 datasets registered

Dataset claimed

Milestone 1

Milestone 2

Milestone 3

Milestone 4

Baca README ini dalam Bahasa Indonesia.

Indonesian NLP is underrepresented in research community, and one of the reasons is the lack of access to public datasets (Aji et al., 2022). To address this issue, we initiate NusaCrowd, a joint collaboration to collect NLP datasets for Indonesian languages. Help us collect and centralize Indonesian NLP datasets, and be a co-author of our upcoming paper.

How to contribute?

You can contribute by proposing unregistered NLP dataset on our record. Just fill out this form, and we will check and approve your entry.

We will give contribution points based on several factors, including: dataset quality, language scarcity, or task scarcity.

You can also propose datasets from your past work that have not been released to the public. In that case, you must first make your dataset open by uploading it publicly, i.e. via Github or Google Drive.

You can submit multiple entries, and if the total contribution points is already above the threshold, we will include you as a co-author (Generally it is enough to only propose 1-2 datasets). Read the full method of calculating points here.

Note: We are not taking any ownership of the submitted dataset. See FAQ below.

Any other way to help?

Yes! Aside from new dataset collection, we are also centralizing existing datasets in a single schema that makes it easier for researchers to use Indonesian NLP datasets. You can help us there by building dataset loader. More details about that here.

Alternatively, we're also listing NLP research papers of Indonesian languages where they do not open their dataset yet. We will contact the authors of these papers later to be involved in NusaCrowd. More about this is available in our Slack server.

FAQs

Who will be the owner of the submitted dataset?

NusaCrowd do not make a clone or copy of the submitted dataset. Therefore, the owner of any submitted dataset will remain to the original author. NusaCrowd simply build a dataloader, i.e. a file downloader + reader so simplify and standardize the data reading process. We also only collect and centralize metadata of the submitted dataset to be listed in our catalogue for better discoverability of your dataset! Citation to the original data owner is also provided for both NusaCrowd and in our catalogue.

How can I find the appropriate license for my dataset?

The license for a dataset is not always obvious. Here are some strategies to try in your search,

  • check for files such as README or LICENSE that may be distributed with the dataset itself
  • check the dataset webpage
  • check publications that announce the release of the dataset
  • check the website of the organization providing the dataset

If no official license is listed anywhere, but you find a webpage that describes general data usage policies for the dataset, you can fall back to providing that URL in the _LICENSE variable. If you can't find any license information, please note in your PR and put _LICENSE="Unknown" in your dataset script.

What if my dataset is not yet publicly available?

You can upload your dataset publicly first, eg. on Github.

Can I create a PR if I have an idea?

If you have an idea to improve or change the code of the nusa-crowd repository, please create an issue and ask for feedback before starting any PRs.

I am confused, can you help me?

Yes, you can ask for helps in NusaCrowd's community channel! Please join our WhatsApp group or Slack server.

Thank you!

We greatly appreciate your help!

The artifacts of this hackathon will be described in a forthcoming academic paper targeting a machine learning or NLP audience. Please refer to this section for your contribution rewards for helping Nusantara NLP. We recognize that some datasets require more effort than others, so please reach out if you have questions. Our goal is to be inclusive with credit!

nusa-crowd's People

Contributors

2112akmal avatar acul3 avatar afaji avatar bryanwilie avatar christianwbsn avatar devsalman avatar faridlazuarda avatar fatyanosa avatar fhudi avatar fozziethebeat avatar gentaiscool avatar holylovenia avatar iamfinethanksu avatar ilhamfp avatar ivanhalimp avatar jen-santoso avatar kaustubhdhole avatar muhsatrio avatar munggok avatar rayendito avatar rifkiaputri avatar ryanignatius avatar samuelcahyawijaya avatar tysonyu avatar wenliangdai avatar yana-xuyan avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.