GithubHelp home page GithubHelp logo

coltekin / seg Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 1.56 MB

An application implementing a few incremental segmentation algorithms

License: Other

Makefile 0.22% C 99.44% Shell 0.18% Python 0.16%

seg's Introduction

This program (named creatively as 'seg') implements a set of
incremental algorithms for learning segmentation described in my PhD
thesis and some of the later work. The thesis can be 
found at <http://dissertations.ub.rug.nl/faculties/arts/2011/c.coltekin/>. 

The latest version of this application can be obtained from
<https://bitbucket.org/coltekin/seg>.

Besides a standard C compiler and 'make', you need

- GLib -- http://developer.gnome.org/glib/
- GNU scientific library www.gnu.org/s/gsl/
- The command-line options are managed using gengetopt
  <http://www.gnu.org/software/gengetopt/>. Generated cmdline.[ch]
  files are included in the distribution. You do not need to install
  gengetopt, unless you want to change the command line interface.

Typing 'make' should build the executable 'seg'. Here are a few
example runs:

- Segment using predictability cue only, using defaults for the
  context size:
```
    ./seg -i data/br-phono.txt \
          -m combine \
          --cues=pred \
          --pred-m=mi
```
- The same, but discard the output, print precision/recall/f-score 
```
    ./seg -i data/br-phono.txt \
          -o /dev/null \
          -m combine \
          --cues=pred \
          --pred-m=mi \
          --print-prf \
          --print-head
```
- Combine predictability with measures mi, h, and rh and utterance
  boundaries (with default context options). Again, discard the
  output, and print a LaTex tabular instead of comma-separated values.
```
    ./seg -i data/br-phono.txt \
          -o /dev/null \
          -m combine \
          --cues=pred,ub \
          --pred-m=mi \
          --print-prf \
          --print-head \
          --print-latex
```
- Do not segment, but print the PMI value for every possible boundary
  location:
```
    ./seg -i data/br-phono.txt \
        --print \
        --pred-m=mi
```

See the output of `seg -h` for more information on the usage.

The code is tested well, and should work fine on any POSIX-like
environment, but it may not be easy to digest as it also uses some
code from earlier projects. The command line options may be confusing
and not well-documented at times. I plan to improve the readability
and usability of the software while working on a few future projects I
have in mind.

This software can be used/modified/distributed under the terms of GNU
General Public License version 3 or later. The licenses and terms of
the corpora included may be different than the license of the
application. See the README file(s) in the data/ directory for more
information.

Questions, comments or corrections are welcome at [email protected]

If you use this application for your research, please cite the
relevant publication(s) from the list below :

- Cağrı Çöltekin (2011). "Catching Words in a Stream of Speech:
  Computational simulations of segmenting transcribed child-directed
  speech." PhD thesis. University of Groningen
- Çağrı Çöltekin John Nerbonne (2014). An explicit statistical model
  of learning lexical segmentation using multiple cues. In: Workshop
  on Cognitive Aspects of Computational Language Learning, EACL 2014

For proper attribution to the data distributed here, please see the
README files under the data/ directory.

seg's People

Contributors

coltekin avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.