GithubHelp home page GithubHelp logo

rifkybujana / indobert-qa Goto Github PK

View Code? Open in Web Editor NEW
18.0 1.0 5.0 4.72 MB

indoBERT Base-Uncased fine-tuned on Translated Squad v2.0

License: Apache License 2.0

Python 3.51% Jupyter Notebook 96.49%
ai deep-learning machine-learning bert huggingface question-answering indonesia indobert

indobert-qa's Introduction

This project is part of my research who won a silver medal at the KoPSI (Kompetisi Penelitian Siswa Indonesia/Indonesian Student Research Competition) entitled "Teman Belajar: Asisten Digital Pelajar SMAN 28 Jakarta dalam Membaca".

indoBERT Base-Uncased fine-tuned on Translated Squad v2.0

IndoBERT trained by IndoLEM and fine-tuned on Translated SQuAD 2.0 for Q&A downstream task.

Model Size (after training): 420mb

Details of indoBERT (from their documentation)

IndoBERT is the Indonesian version of BERT model. This model trained using over 220M words, aggregated from three main sources:

  • Indonesian Wikipedia (74M words)
  • news articles from Kompas, Tempo (Tala et al., 2003), and Liputan6 (55M words in total)
  • an Indonesian Web Corpus (Medved and Suchomel, 2017) (90M words).

This model trained for 2.4M steps (180 epochs) with the final perplexity over the development set being 3.97 (similar to English BERT-base). This IndoBERT was used to examine IndoLEM - an Indonesian benchmark that comprises of seven tasks for the Indonesian language, spanning morpho-syntax, semantics, and discourse.

Details of the downstream task (Q&A) - Dataset

SQuAD2.0 combines the 100,000 questions in SQuAD1.1 with over 50,000 unanswerable questions written adversarially by crowdworkers to look similar to answerable ones. To do well on SQuAD2.0, systems must not only answer questions when possible, but also determine when no answer is supported by the paragraph and abstain from answering.

Dataset Split # samples
SQuAD2.0 train 130k
SQuAD2.0 eval 12.3k

Model Training

The model was trained on a Tesla T4 GPU and 12GB of RAM.

Results:

Metric # Value
EM 51.61
F1 69.09

Pipeline Usage

from pipeline import Pipeline

pipeline = Pipeline("model")
pipeline.predict(
    "Pangeran Harya Dipanegara (atau biasa dikenal dengan nama Pangeran Diponegoro, lahir di Ngayogyakarta Hadiningrat, 11 November 1785 – meninggal di Makassar, Hindia Belanda, 8 Januari 1855 pada umur 69 tahun) adalah salah seorang pahlawan nasional Republik Indonesia, yang memimpin Perang Diponegoro atau Perang Jawa selama periode tahun 1825 hingga 1830 melawan pemerintah Hindia Belanda. Sejarah mencatat, Perang Diponegoro atau Perang Jawa dikenal sebagai perang yang menelan korban terbanyak dalam sejarah Indonesia, yakni 8.000 korban serdadu Hindia Belanda, 7.000 pribumi, dan 200 ribu orang Jawa serta kerugian materi 25 juta Gulden.",
    "kapan pangeran diponegoro lahir?"
)

output:

{
    "best answer": OrderedDict([(0, '11 November 1785')]),
    "answers": [{'score': 12.36208, 'text': '11 November 1785'}, {'score': 9.136721, 'text': '11 November 1785 - meninggal di Makassar, Hindia Belanda, 8 Januari 1855'}, {'score': 8.018387, 'text': '1785'}, {'score': 6.1863337, 'text': 'Ngayogyakarta Hadiningrat, 11 November 1785'}, {'score': 6.091961, 'text': '11 November 1785 -'}, {'score': 5.8137712, 'text': '11 November 178'}, {'score': 5.579988, 'text': '11 November'}, {'score': 5.423601, 'text': '11 November 1785 - meninggal di Makassar, Hindia Belanda, 8 Januari 1855 pada umur 69 tahun'}, ...]
}

Simple Usage (Using Huggingface)

from transformers import pipeline
qa_pipeline = pipeline(
    "question-answering",
    model="Rifky/Indobert-QA",
    tokenizer="Rifky/Indobert-QA"
)
qa_pipeline({
    'context': """Pangeran Harya Dipanegara (atau biasa dikenal dengan nama Pangeran Diponegoro, lahir di Ngayogyakarta Hadiningrat, 11 November 1785 – meninggal di Makassar, Hindia Belanda, 8 Januari 1855 pada umur 69 tahun) adalah salah seorang pahlawan nasional Republik Indonesia, yang memimpin Perang Diponegoro atau Perang Jawa selama periode tahun 1825 hingga 1830 melawan pemerintah Hindia Belanda. Sejarah mencatat, Perang Diponegoro atau Perang Jawa dikenal sebagai perang yang menelan korban terbanyak dalam sejarah Indonesia, yakni 8.000 korban serdadu Hindia Belanda, 7.000 pribumi, dan 200 ribu orang Jawa serta kerugian materi 25 juta Gulden.""",
    'question': "kapan pangeran diponegoro lahir?"
})

output:

{
  'answer': '11 November 1785',
  'end': 131,
  'score': 0.9272009134292603,
  'start': 115
}

Reference

Fajri Koto and Afshin Rahimi and Jey Han Lau and Timothy Baldwin. 2020. IndoLEM and IndoBERT: A Benchmark Dataset and Pre-trained Language Model for Indonesian NLP. Proceedings of the 28th COLING.

indobert-qa's People

Contributors

dependabot[bot] avatar rifkybujana avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.