GithubHelp home page GithubHelp logo

formality_multi_domain_nmt's Introduction

Formality-augmented Machine Translation

Introduction

Motivation

This project aims at modifying existing machine translation architectures so that they would be able to classify sentences according to their formality and reproduce sentences in the correct formality class.

Problem description

We picked Japanese and Korean as our primary target languages, due to formality manifesting in their respective morphology. In addition to basic machine translation, our model performs two additional tasks:

  • Indentify the formality class of the input sentence;

  • Emit sentences in the target language with the correct formality class.

Training data

Please contact the owners of this repository for details of and/or access to data that was used for this project.

Formality labels

Bilingual corpora annotated with sentence formality are basically non-existent. As such, we devised several ways to generated formality labels from the corpora that were available to us.

Procedural generation of Japanese formality labels

We adapted an earlier work which classified the formality of a Japanese sentence according to the final verb of the sentence. The SOV sentence structure of Japanese means that the final verb is the outermost verb in any verb phrase embedding structure, and therefore also the main verb of the sentence. The classifier script can be found in formality_classification.py.

Two examples of sentence formality classifications:

  • 同情してただけなんだ: informal

  • 別に驚くことではないですよね: formal

For more details on rule-based formality classification, please refer to earlier work by Weston Feely, Eva Hasler and Adrià de Gispert. Their original repository can be found at https://github.com/wfeely/japanese-verb-conjugator.

Using pre-trained Japanese language model for formality classification

We adapted the pre-trained Japanese BERT (credit to Tohoku University, available on huggingface) to output a politeness label given an input sentence. Script can be found at src/pre-trained.py

Model Architecture

Base translation model is an autoregressive transformer. Model implementation can be found at src/model.py.

Training and evaluation scripts for our model can be found in the src directory.

Results and bibliography

Please refer to our report at https://hstehstehste.github.io/Projects/multi_formality_domain.html for details on our experiment design, test results, and complete bibliography.

formality_multi_domain_nmt's People

Contributors

rlp49 avatar flyingsledge avatar hstehstehste avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.