GithubHelp home page GithubHelp logo

eamt24-linguistic-mt's Introduction

Linguistically Motivated Neural Machine Translation

The official repo for the tutorial "Linguistically Motivated Neural Machine Translation" in The 25th Annual Conference of the European Association for Machine Translation (EAMT 2024).

logo

Slides

Part 1 Introduction Part 1: Introduction slide
Part 2 Encoder Part 2: Linguistic Features and Encoder slide
Part 3 Subword, Decoder, Evaluation Part 3: Subword, Decoder, Evaluation slide

Presenters

Haiyue Song Haiyue Song is a technical researcher at the Advanced Translation Technology Laboratory, National Institute of Information and Communications Technology (NICT), Japan. He obtained his Ph.D. at Kyoto University. His research interests include machine translation, large language models, subword segmentation, and decoding algorithms. He has MT and LLMs related publications in TALLIP, AACL, LREC, ACL, and EMNLP.
Hour Kaing Hour Kaing is a researcher at the Advanced Translation Technology Laboratory, National Institute of Information and Communications Technology (NICT), Japan. He received his B.S. from Institute of Technology of Cambodia, Cambodia, his M.Sc from University of Grenoble 1, France, and his Ph.D. from NARA Institute of Science and Technology, Japan. He is interested in linguistic analysis, low-resource machine translation, language modeling, and speech processing. He has publications in TALLIP, EACL, PACLIC, LREC, and IWSLT.
Raj Dabre Raj Dabre is a senior researcher at the Advanced Translation Technology Laboratory, National Institute of Information and Communications Technology (NICT), Japan and an Adjunct Faculty at IIT Madras, India. He received his Ph.D. from Kyoto University and Masters from IIT Bombay. His primary interests are in low-resource NLP, language modeling and efficiency. He has published in ACL, EMNLP, NAACL, TMLR, AAAI, AACL, IJCNLP and CSUR.

Programme

Date: June 27, 2024 (Thursday), 9:00 AM - 12:30 PM

Time
9:00-9:20 Introduction
9:20-10:20 Augmenting NMT Architectures with Linguistic Features
10:20-10:50 Coffee break
10:50-11:20 Linguistically Motivated Tokenization and Transfer Learning
11:20-11:40 Linguistically Aware Decoding
11:40-12:00 Linguistically Motivated Evaluation
12:00-12:15 Conclusions
12:15-12:30 QA

Introduction

The tutorial focuses on incorporating linguistics into different stages of the neural machine translation (NMT) pipeline, from pre-processing to model training to evaluation.

Tutorial Overview

Relevance to the MT Community

For machine translation (MT) tasks, purely data-driven approaches have been dominant in recent years, and language knowledge-related approaches are often neglected. This tutorial aims to highlight the importance of linguistic knowledge, especially for low-resource languages where training data is limited.

Outline

  1. Introduction to Neural Machine Translation
  2. Linguistically Motivated Tokenization and Transfer Learning
  3. Augmenting NMT Architectures with Linguistic Features
  4. Linguistically Aware Decoding
  5. Linguistically Motivated Evaluation
  6. Limitations and Future Directions
  7. Summary and Conclusion
  8. Discussion and Q/A

📖 Reading List

1. Introduction to Neural Machine Translation

  1. Neural Machine Translation: Basics, Practical Aspects and Recent Trends - Dabre et al., 2017
  2. Attention is All you Need - Vaswani et al., 2017
  3. Neural Machine Translation by Jointly Learning to Align and Translate - Bahdanau et al., 2016

2. Linguistically Motivated Tokenization and Transfer Learning

  1. Juman++: A Morphological Analysis Toolkit for Scriptio Continua - Tolmachev et al., 2018
  2. Dynamic Programming Encoding for Subword Segmentation in Neural Machine Translation - He et al., 2020
  3. BERTSeg: BERT Based Unsupervised Subword Segmentation for Neural Machine Translation - Song et al., 2022
  4. MorphyNet: a Large Multilingual Database of Derivational and Inflectional Morphology - Batsuren et al., 2021
  5. Linguistically Motivated Vocabulary Reduction for Neural Machine Translation from Turkish to English - Ataman et al., 2017
  6. Linguistically Motivated Subwords for English-Tamil Translation: University of Groningen’s Submission to WMT-2020 - Dhar et al., 2020
  7. Neural Machine Translation of Logographic Languages Using Sub-character Level Information
  8. On Romanization for Model Transfer Between Scripts in Neural Machine Translation - Amrhein and Sennrich, 2020
  9. RomanSetu: Efficiently unlocking multilingual capabilities of Large Language Models models via Romanization - Husain et al., 2024
  10. CharSpan: Utilizing Lexical Similarity to Enable Zero-Shot Machine Translation for Extremely Low-resource Languages - Maurya et al., 2024
  11. SelectNoise: Unsupervised Noise Injection to Enable Zero-Shot Machine Translation for Extremely Low-resource Languages - Brahma et al., 2023
  12. Pre-training via Leveraging Assisting Languages for Neural Machine Translation - Song et al., 2020
  13. IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for all 22 Scheduled Indian Languages - Gala et al., 2023
  14. IndicBART: A Pre-trained Model for Natural Language Generation of Indic Languages - Dabre et al., 2021

3. Augmenting NMT Architectures with Linguistic Features

  1. Linguistic Input Features Improve Neural Machine Translation - Sennrich and Haddow, 2016
  2. FeatureBART: Feature Based Sequence-to-Sequence Pre-Training for Low-Resource NMT - Chakrabarty et al., 2022
  3. Improving Low-Resource NMT through Relevance Based Linguistic Features Incorporation - Chakrabarty et al., 2020
  4. Low-Resource Multilingual Neural Translation Using Linguistic Feature Based Relevance Mechanisms - Chakrabarty et al., 2023
  5. Exploiting Linguistic Resources for Neural Machine Translation Using Multi-task Learning - Niehues and Cho, 2017
  6. Syntax-Enhanced Neural Machine Translation with Syntax-Aware Word Representations - Zhang et al., 2019
  7. Dependency-to-Dependency Neural Machine Translation - Wu et al., 2018
  8. Multi-Source Syntactic Neural Machine Translation - Currey and Heafield, 2018
  9. Incorporating Source Syntax into Transformer-Based Neural Machine Translation - Currey and Heafield, 2019
  10. Enhancing Machine Translation with Dependency-Aware Self-Attention - Bugliarello and Okazaki, 2020
  11. Passing Parser Uncertainty to the Transformer: Labeled Dependency Distributions for Neural Machine Translation - Pu and Sima'an, 2022
  12. Modeling Source Syntax for Neural Machine Translation - Li et al., 2017

4. Linguistically Aware Decoding

  1. Tree-to-Sequence Attentional Neural Machine Translation - Eriguchi et al., 2016
  2. Improved Neural Machine Translation with a Syntax-Aware Encoder and Decoder - Chen et al., 2017
  3. Learning to Parse and Translate Improves Neural Machine Translation - Eriguchi et al., 2017
  4. Sequence-to-Dependency Neural Machine Translation - Wu et al., 2017
  5. A Tree-based Decoder for Neural Machine Translation - Wang et al., 2018
  6. Towards String-To-Tree Neural Machine Translation - Aharoni and Goldberg, 2017
  7. Predicting Target Language CCG Supertags Improves Neural Machine Translation - Nǎdejde et al., 2017
  8. Improving Neural Machine Translation with Soft Template Prediction - Yang et al., 2020
  9. Explicit Syntactic Guidance for Neural Text Generation - Li et al., 2023

5. Linguistically Motivated Evaluation

  1. Linguistic Evaluation for the 2021 State-of-the-art Machine Translation Systems for German to English and English to German - Macketanz et al., 2021
  2. Linguistically Motivated Evaluation of Machine Translation Metrics Based on a Challenge Set - Avramidis and Macketanz, 2022
  3. Linguistically Motivated Evaluation of the 2023 State-of-the-art Machine Translation: Can ChatGPT Outperform NMT? - Manakhimova et al, 2023

Authors

Haiyue Song, Hour Kaing, Raj Dabre
National Institute of Information and Communications Technology (NICT)
Hikaridai 3-5, Seika-cho, Soraku-gun, Kyoto, Japan

Emails:

Citation (bib)

@article{linguistic-mt24,
  title={Linguistically Motivated Neural Machine Translation},
  author={Song, Haiyue and Kaing, Hour and Dabre, Raj},
  booktitle={The 25th Annual Conference of the European Association for Machine Translation (EAMT 2024)},
  year={2024}
}

eamt24-linguistic-mt's People

Contributors

shyyhs avatar

Stargazers

Zhuoyuan Mao avatar Pratyay Banerjee avatar  avatar Anup Kumar Gupta avatar William N. Havard avatar Jorge Iranzo avatar

Watchers

Raj Dabre avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.