GithubHelp home page GithubHelp logo

myparaphrase's Introduction

myParaphrase

Paraphrase Dataset for Burmese (Myanmar Language)

Paraphrase detection or semantic similarity of necessity is to understand the sentence as a whole sentence, but not just finding synonyms of the words. It is an important research area in natural language processing that plays a significant role in many applications such as question answering, summarization, information retrieval, and extraction. To the best of our knowledge, no studies have been conducted on Burmese (Myanmar language) paraphrase or not paraphrase detection and classification. We did paraphrase classification experiments beased on both traditional machine learning methods and deep Siamese neural networks. One more contribution is the development of the human-annotated combination of Burmese paraphrase and non-paraphrase corpus that contained 40,461 sentence pairs and open-test data with 1,000 sentence pairs.

Introduction in Burmese (Myanmar Language)

ဒီ corpus က paraphrase လို့ခေါ်တဲ့ စကားလုံး မတူတာတွေကို သုံးထားပေမဲ့ စာကြောင်း တစ်ကြောင်းလုံးအနေနဲ့က အဓိပ္ပါယ်အားဖြင့် တူတယ်၊ မတူဘူး ဆိုတာကို ကွန်ပျူတာက ခွဲခြား သိနိုင်တဲ့ မော်ဒယ်ကို စမ်းဆောက်ကြည့်ဖို့အတွက် အသုံးပြုဖို့ ရည်ရွယ်ပြီး ဆောက်ခဲ့တဲ့ corpus တစ်ခုပါ။ မြန်မာစာ NLP သုတေသန အလုပ်အတွက် အသုံးဝင်ပါလိမ့်မယ်။ ကျွန်တော်နဲ့ ကျွန်တော့ Ph.D. ကျောင်းသူ မမြင့်မြင့်ဌေး တို့က ၂နှစ်ကျော် အချိန်ယူ ပြင်ဆင်ထားခဲ့ကြတာပါ။ စာကြောင်းရေ စုစုပေါင်း လေးသောင်းလေးရာကျော် ရှိပါတယ်။ open-test data အနေနဲ့လည်း သပ်သပ် စာကြောင်းရေ တစ်ထောင်ကို ပြင်ဆင်ခဲ့ကြပါတယ်။ နောက်ပိုင်း အခြေအနေ ပေးရင်ပေးသလို ဆက်လက်ပြီး corpus ကို တည်ဆောက်သွားဖို့ ရည်ရွယ်ထားပါတယ်။

Versions Information

Version 1.0 Release Date: 3 December 2022

Data Format Example

CSV header is as follows:

"id","pid1","pid2","paraphrase1","paraphrase2","is_paraphrase"

Some examples of tagged paraphrase sentences are as follows:

(base) yekyaw.thu@gpu:~/exp/siamese/myParaphrase/corpus/ver1.0/csv-qqp$ shuf train-qqp.csv | head
"19830","19831","19832","ဒီလို မကြာမကြာ အတူ စား ကြရအောင် ။","ငါတို့ မင်း ကို ကျေးဇူးတင် တယ် ။","0"
"24566","24567","24568","မင်း ဘယ် ရထား စီးလာ မှာလဲ ။","ကျွန်မက သက်သက်လွတ်သမားရှင့် ။","0"
"28755","28756","28757","ရေအေးအေးလေး က အမောပြေ စေ တယ် ။","ရေအေးအေးလေး က အမော ကို ပြေ သွား တာ ပဲ ။","1"
"23088","23089","23090","ဘယ်သူ့ ကို သံသယဖြစ် တာ လဲ ။","ကဗျာ ရေး သလား ။","0"
"28697","28698","28699","ရေချိုးခန်း လည်း ပါ သည် ။","နည်းလမ်း ရှာ တာ လား ။","0"
"14700","14701","14702","တံခါးပေါက် ဆီ အလုအယက် ထွက် နေ တဲ့ လူ အုပ် က ကျွန်တော့် ကို နင်း မိ တော့ မ လို ဖြစ် သွား တယ် ။","တံခါးပေါက် ဆီ အလုအယက် ထွက် နေ တဲ့ လူ အုပ် နဲ့အတူ ကျွန်တော့် မနည်း ပြေး လိုက် ရ တယ် ။","0"
"16027","16028","16029","တော် ပါ တယ် ကြိုးစား ပါ","တော် လိုက် တာ ကြိုးစား နော် အားမလျှော့ နဲ့","1"
"22766","22767","22768","ဘယ် လို ပဲ ဒုက္ခ တွေ အနည်းနည်းအဖုံဖုံ လာ ပါစေ သတ္တိ ရှိ ပါ ။","ဘယ် လို ပဲ ပျော် စရာ တွေ အမျိုးမျိုး လာ ပါစေ တည်ငြိမ် ပါ ။","0"
"26023","26024","26025","မနက်ဖြန် နေ့ခင်း ပွဲ ဆို ရင် ကော ဘယ်လို လဲ ။","မနက်ဖြန် နေ့ခင်း ပွဲ ကြည့် ကြ မယ် ။","0"
"24002","24003","24004","မင်း ကို သူ လက်စားချေ မှာ မဟုတ်ပါ ဘူး ။","သူမ အဲ့ဒါ ကို ထပ် စဉ်းစား ခဲ့ ပါ တယ် ။","0"

Contributors

Myint Myint Htay
Ye Kyaw Thu

Experimental Setting for Demo Running

We used only training data for demo running. For the three Siamese Neural Network model building, we used https://github.com/tlatkowski/multihead-siamese-nets.

Some important hyperparameters for all three models are as follows:

[TRAINING]
num_epochs = 10
batch_size = 512
eval_every = 20
learning_rate = 0.001
checkpoints_to_keep = 5
save_every = 100
log_device_placement = False

[DATA]
logs_path = logs
model_dir = model_dir

[PARAMS]
embedding_size = 64
loss_function = MSE
char_embeddings = False

Specific hyperparameter for RNN:

[PARAMS]
hidden_size = 128
cell_type = GRU
bidirectional = True

Specific hyperparameter for CNN:

[PARAMS]
num_filters = 50,50,50
filter_sizes = 2,3,4
dropout_rate = 0.0

Specific hyperparameter for Transformer:

[PARAMS]
num_blocks = 2
num_heads = 8
use_residual = False
dropout_rate = 0.0

Experimental Results

Table.1 Evaluation results on RNN-Siamese, CNN-Siamese and Transformer-Siamese with myParaphrase corpus (version 1.0)

Model Mean-Dev-Accuracy Last-Dev-Accuracy Test-Acc Training/Validation Time
bi-RNN 0.84 0.87 0.85 2m2.830s
CNN 0.88 0.89 0.88 0m33.637s
Transformer 0.81 0.82 0.81 1m38.253s

Accuracy and loss graphs for 3 Siamese models

Fig.1 Accuracy and loss comparison graphs for three Siamese models (RNN-Siamese, CNN-Siamese and Transformer-Siamese) with myParaphrase corpus (version 1.0)

Accuracy graphs for 3 Siamese models

Fig.2 Accuracy details for three Siamese models (RNN-Siamese, CNN-Siamese and Transformer-Siamese) with myParaphrase corpus (version 1.0)

Citation

If you want to use myParaphrase corpus (version 1.0) in your research and we'd appreciate if you use the following reference:

Myint Myint Htay, Ye Kyaw Thu, Hnin Aye Thant, Thepchai Supnithi, "Deep Siamese Neural Network Vs Random Forest for Myanmar Language Paraphrase Classification", Journal of Intelligent Informatics and Smart Technology, Oct 2nd Issue, 2022, pp. 25-1 to 25-9. (Submitted Feb 21, 2022; accepted July 17, 2022; published on 31 Oct 2022) [Paper]

If you want to use three Siamese models that we trained with myParaphrase (version 1.0) and please cite the following link:

https://github.com/ye-kyaw-thu/myParaphrase

Reference

We did paraphrase classification experiments with "multihead-siamese-nets":

[1] https://github.com/tlatkowski/multihead-siamese-nets

Some run time errors were solved based on followings:

[2] https://stackoverflow.com/questions/55318626/module-tensorflow-has-no-attribute-logging
[3] https://stackoverflow.com/questions/61102281/dataframe-object-has-no-attribute-as-matrix

Read some papers:

[4]

@inproceedings{NIPS2017_3f5ee243,
 author = {Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, \L ukasz and Polosukhin, Illia},
 booktitle = {Advances in Neural Information Processing Systems},
 editor = {I. Guyon and U. Von Luxburg and S. Bengio and H. Wallach and R. Fergus and S. Vishwanathan and R. Garnett},
 pages = {},
 publisher = {Curran Associates, Inc.},
 title = {Attention is All you Need},
 url = {https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf},
 volume = {30},
 year = {2017}
}

[5]

@inproceedings{ranasinghe-etal-2019-semantic,
    title = "Semantic Textual Similarity with {S}iamese Neural Networks",
    author = "Ranasinghe, Tharindu  and
      Orasan, Constantin  and
      Mitkov, Ruslan",
    booktitle = "Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)",
    month = sep,
    year = "2019",
    address = "Varna, Bulgaria",
    publisher = "INCOMA Ltd.",
    url = "https://aclanthology.org/R19-1116",
    doi = "10.26615/978-954-452-056-4_116",
    pages = "1004--1011",
    abstract = "Calculating the Semantic Textual Similarity (STS) is an important research area in natural language processing which plays a significant role in many applications such as question answering, document summarisation, information retrieval and information extraction. This paper evaluates Siamese recurrent architectures, a special type of neural networks, which are used here to measure STS. Several variants of the architecture are compared with existing methods",
}

[6]

@inproceedings{Koch2015SiameseNN,
  title={Siamese Neural Networks for One-Shot Image Recognition},
  author={Gregory R. Koch},
  year={2015}
}

To Do

  • to update the myParaphrase corpus
  • to study on longer Burmese sentences and paragraph level

myparaphrase's People

Contributors

ye-kyaw-thu avatar

Stargazers

Aung K. Htet avatar Wut Hmone Hnin Hlaing aka Alex ~ avatar Than Lwin Aung avatar June avatar

Watchers

James Cloos avatar  avatar  avatar

myparaphrase's Issues

Missing value in CSV, TSV, etc

Hi,

773,"ကျနော် စောစော ပြန် တော့ ခင်ဗျား စိတ်ဆိုး မှာ လား",""

  • second paraphrase sentence is missing

773,ကျနော် စောစော ပြန် တော့ ခင်ဗျား စိတ်ဆိုး မှာ လား,,ကျနော် စောစော ပြန် လျှင် ခင်ဗျား စိတ်ဆိုး မှာ ပါ လား

  • second paraphrase sentence is placed as label

I suggest removing this instance from dataset (id = 773) since it is in test split

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.