GithubHelp home page GithubHelp logo

yparvej / fullstop-deep-punctuation-prediction Goto Github PK

View Code? Open in Web Editor NEW

This project forked from oliverguhr/fullstop-deep-punctuation-prediction

0.0 0.0 0.0 95.43 MB

A model that predicts the punctuation of English, Italian, French and German texts.

Home Page: https://huggingface.co/oliverguhr/fullstop-punctuation-multilang-large

License: MIT License

Shell 8.17% Python 91.83%

fullstop-deep-punctuation-prediction's Introduction

FullStop: Multilingual Deep Models for Punctuation Prediction

This work describes our contribution to the SEPP-NLG Shared Task in multilingual sentence segmentation and punctuation prediction. The goal of this task consists in training NLP models that can predict the end of sentence (EOS) and punctuation marks on automatically generated or transcribed texts. We show that these tasks benefit from crosslingual transfer by successfully employing multilingual deep language models. Our multilingual model achieves an average F1-score of 0.94 for EOS prediction on English, German, French, and Italian texts and an average F1-score of 0.78 for punctuation mark prediction.

fullstop-demo.mp4

How to use:

If you like to use the model on your own texts, you can use our python package.

Install

To get started install the package from pypi:

pip install deepmultilingualpunctuation

Make sure your system has cmake installed, it's a prerequisite.

Usage

The PunctuationModel class an process texts of any length. Note that processing of very long texts can be time consuming.

Restore Punctuation

from deepmultilingualpunctuation import PunctuationModel

model = PunctuationModel()
text = "My name is Clara and I live in Berkeley California Ist das eine Frage Frau Müller"
result = model.restore_punctuation(text)
print(result)

output

My name is Clara and I live in Berkeley, California. Ist das eine Frage, Frau Müller?

Predict Labels

from deepmultilingualpunctuation import PunctuationModel

model = PunctuationModel()
text = "My name is Clara and I live in Berkeley California Ist das eine Frage Frau Müller"
clean_text = model.preprocess(text)
labled_words = model.predict(clean_text)
print(labled_words)

output

[['My', '0', 0.9999887], ['name', '0', 0.99998665], ['is', '0', 0.9998579], ['Clara', '0', 0.6752215], ['and', '0', 0.99990904], ['I', '0', 0.9999877], ['live', '0', 0.9999839], ['in', '0', 0.9999515], ['Berkeley', ',', 0.99800044], ['California', '.', 0.99534047], ['Ist', '0', 0.99998784], ['das', '0', 0.99999154], ['eine', '0', 0.9999918], ['Frage', ',', 0.99622655], ['Frau', '0', 0.9999889], ['Müller', '?', 0.99863917]]

Results

The performance differs for the single punctuation markers as hyphens and colons, in many cases, are optional and can be substituted by either a comma or a full stop. The model achieves the following F1 scores for the different languages:

Label EN DE FR IT
0 0.991 0.997 0.992 0.989
. 0.948 0.961 0.945 0.942
? 0.890 0.893 0.871 0.832
, 0.819 0.945 0.831 0.798
: 0.575 0.652 0.620 0.588
- 0.425 0.435 0.431 0.421
macro average 0.775 0.814 0.782 0.762

Available Models

Languages Model
English, Italian, French and German oliverguhr/fullstop-punctuation-multilang-large
English, Italian, French, German and Dutch oliverguhr/fullstop-punctuation-multilingual-base
Dutch oliverguhr/fullstop-dutch-punctuation-prediction
Catalan softcatala/fullstop-catalan-punctuation-prediction

You can use different models by setting the model parameter:

model = PunctuationModel(model = "oliverguhr/fullstop-dutch-punctuation-prediction")

Training other languages

See training other languages

How to reproduce

In order to reproduce our experiments you can execute following code:

# setup
pip install -r requirements.txt
sh download-dataset.sh

# training
python model_test_suite.py 1
python model_test_suite.py 2

# test
sh predict_all_task_1_models.sh
sh predict_all_task_2_models.sh

Competition Website

https://sites.google.com/view/sentence-segmentation/

Data

https://drive.switch.ch/index.php/s/g3fMhMZU2uo32mf

Cite us

@article{guhr-EtAl:2021:fullstop,
  title={FullStop: Multilingual Deep Models for Punctuation Prediction},
  author    = {Guhr, Oliver  and  Schumann, Anne-Kathrin  and  Bahrmann, Frank  and  Böhme, Hans Joachim},
  booktitle      = {Proceedings of the Swiss Text Analytics Conference 2021},
  month          = {June},
  year           = {2021},
  address        = {Winterthur, Switzerland},
  publisher      = {CEUR Workshop Proceedings},  
  url       = {http://ceur-ws.org/Vol-2957/sepp_paper4.pdf}
}

fullstop-deep-punctuation-prediction's People

Contributors

jordimas avatar oliverguhr avatar zevaverbach avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.