GithubHelp home page GithubHelp logo

skt's Introduction

Sanskrit compound segmentation using seq2seq model

Code for the paper titled 'Building a Word Segmenter for Sanskrit Overnight'

Instructions

Pre-requisites

The following python packages are required to be installed:

File organization

  • Data is located in data/.
  • Logs generated by tensorflow summarywriter is stored in logs/.
  • Models which are trained are stored in models/. Before training make sure the folders logs/ and models/ are created.

Training

The file train.py can be used to train the model. The file test.py can be used to test the model.

Data

data/ folder contains all the data used for the segmentation task.

All the .txt files are already tokenized using sentencepiece, the m.vocab and m.model files are the onces generated by sentencepiece and they can be used to tokenize any other data with the same vocabulary.

All .txt files contain data which are separeted by a new line(\n).

Training data:

  • dcs_data_input_train_sent.txt file contains the input sentences used for training.
  • dcs_data_output_train_sent.txt file contains the output words(segmented forms of input) used for training.

Testing data:

  • dcs_data_input_test_sent.txt file contains the input sentences used for testing.
  • dcs_data_output_test_sent.txt file contains the output words(segmented forms of input) used for testing.

Once the train.py file is run, it creates various other files for word2id, id2word, ... more details are provided in the utils/data_loader.py file.

Testing on other data

To test on your your own data, create a file with all the sentences that are to segmented, one sentence per line and run the unsupervised segmenter(sentencepiece) using the m.vocab and m.model files present in the utils/ folder. Also, create another file with ground truth outputs again tokenized with sentencepiece. If you only intend to get the output from the model and not compare it with your ground truth data, then keep an empty file with same number of lines as the input data file.

Then modify the test.py with the appropriate path and run it.

skt's People

Contributors

cvikasreddy avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.