GithubHelp home page GithubHelp logo

portuguese's Introduction

I have implemented a model for restoring capitalization and diacritics to plain Portuguese text.

My data comes from the March 31, 2014 Portuguese Wikipedia dump. I have included the original text and the preprocessed versions in /data.

I have included the Python preprocessing scripts in /preprocess. Each script reads from stdin and writes to stdout. These scripts depended on the unidecode module.

I have included my Weka models and output result buffers in /models and /results respectively.

I have included general purpose scripts in /scripts. Within this folder, confusion.sh can be used to turn Weka’s text-based confusion matrices into csv files to be used with Excel. start.sh adds the appropriate Weka sources to the user's CLASSPATH (the user will need to modify the WEKA_PATH variable in the script). Afterwards, a user can use convert.sh to restore capitalization and diacritics to a text or to see what would have been the classifier's predictions for a plain-text version. Input is provided through stdin and output will be printed to stdout. This pipeline uses the J48 N=3 classifier. This is neither an efficient pipeline nor is it robust, but it can be used to demonstrate the classifier's basics capabilities.

To run:
. scripts/start.sh
sh scripts/convert.sh < your_file_name
Diacritic Key For .arff Files
accent lower upper
none (0) (6)
` (1) (7)
´ (2) (8)
^ (3) (9)
~ (4) (10)
ç (5) (11)

portuguese's People

Contributors

brosenfeld avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.